fix(ees-tap): fix BOM handling for admissions CSV
All checks were successful
Build and Push Docker Images / Build Backend (FastAPI) (push) Successful in 33s
Build and Push Docker Images / Build Frontend (Next.js) (push) Successful in 1m6s
Build and Push Docker Images / Build Integrator (push) Successful in 57s
Build and Push Docker Images / Build Kestra Init (push) Successful in 32s
Build and Push Docker Images / Build Pipeline (Meltano + dbt + Airflow) (push) Successful in 1m40s
Build and Push Docker Images / Trigger Portainer Update (push) Successful in 1s
All checks were successful
Build and Push Docker Images / Build Backend (FastAPI) (push) Successful in 33s
Build and Push Docker Images / Build Frontend (Next.js) (push) Successful in 1m6s
Build and Push Docker Images / Build Integrator (push) Successful in 57s
Build and Push Docker Images / Build Kestra Init (push) Successful in 32s
Build and Push Docker Images / Build Pipeline (Meltano + dbt + Airflow) (push) Successful in 1m40s
Build and Push Docker Images / Trigger Portainer Update (push) Successful in 1s
Admissions file is UTF-8 with BOM, not Latin-1. Reading as latin-1 decoded the BOM bytes as '' which wasn't stripped. Change admissions encoding to utf-8-sig (strips BOM automatically). Also update the manual BOM strip fallback to handle the latin-1 decoded form. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -83,8 +83,13 @@ class EESDatasetStream(Stream):
|
|||||||
with zf.open(target) as f:
|
with zf.open(target) as f:
|
||||||
df = pd.read_csv(f, dtype=str, keep_default_na=False, encoding=self._encoding)
|
df = pd.read_csv(f, dtype=str, keep_default_na=False, encoding=self._encoding)
|
||||||
|
|
||||||
# Strip UTF-8 BOM from column names (some DfE files have a BOM on the first column)
|
# Strip BOM from first column name — handles both:
|
||||||
df.columns = df.columns.str.lstrip("\ufeff")
|
# - UTF-8 BOM decoded as Unicode (\ufeff) when read with utf-8/utf-8-sig
|
||||||
|
# - UTF-8 BOM bytes decoded as Latin-1 () when read with latin-1
|
||||||
|
cols = list(df.columns)
|
||||||
|
if cols:
|
||||||
|
cols[0] = cols[0].lstrip("\ufeff").lstrip("")
|
||||||
|
df.columns = cols
|
||||||
|
|
||||||
# Filter to school-level data if the column exists
|
# Filter to school-level data if the column exists
|
||||||
if "geographic_level" in df.columns:
|
if "geographic_level" in df.columns:
|
||||||
@@ -292,7 +297,7 @@ class EESAdmissionsStream(EESDatasetStream):
|
|||||||
primary_keys = ["school_urn", "time_period"]
|
primary_keys = ["school_urn", "time_period"]
|
||||||
_publication_slug = "primary-and-secondary-school-applications-and-offers"
|
_publication_slug = "primary-and-secondary-school-applications-and-offers"
|
||||||
_target_filename = "SchoolLevel"
|
_target_filename = "SchoolLevel"
|
||||||
_encoding = "latin-1"
|
_encoding = "utf-8-sig" # UTF-8 with BOM — sig variant strips the BOM automatically
|
||||||
schema = th.PropertiesList(
|
schema = th.PropertiesList(
|
||||||
th.Property("time_period", th.StringType, required=True),
|
th.Property("time_period", th.StringType, required=True),
|
||||||
th.Property("school_urn", th.StringType, required=True),
|
th.Property("school_urn", th.StringType, required=True),
|
||||||
|
|||||||
Reference in New Issue
Block a user