fix(ees-tap): strip UTF-8 BOM from CSV column names
All checks were successful
Build and Push Docker Images / Build Backend (FastAPI) (push) Successful in 32s
Build and Push Docker Images / Build Frontend (Next.js) (push) Successful in 1m12s
Build and Push Docker Images / Build Integrator (push) Successful in 55s
Build and Push Docker Images / Build Kestra Init (push) Successful in 31s
Build and Push Docker Images / Build Pipeline (Meltano + dbt + Airflow) (push) Successful in 1m42s
Build and Push Docker Images / Trigger Portainer Update (push) Successful in 0s

Some DfE supporting-files CSVs have a UTF-8 BOM on the first column,
causing it to be named '\ufefftime_period' instead of 'time_period'.
This trips Singer schema validation ('time_period' is a required property).
Strip the BOM from all column names after read_csv.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-03-27 09:54:15 +00:00
parent f4f0257447
commit b8ecc5c58b

View File

@@ -83,6 +83,9 @@ class EESDatasetStream(Stream):
with zf.open(target) as f: with zf.open(target) as f:
df = pd.read_csv(f, dtype=str, keep_default_na=False, encoding=self._encoding) df = pd.read_csv(f, dtype=str, keep_default_na=False, encoding=self._encoding)
# Strip UTF-8 BOM from column names (some DfE files have a BOM on the first column)
df.columns = df.columns.str.lstrip("\ufeff")
# Filter to school-level data if the column exists # Filter to school-level data if the column exists
if "geographic_level" in df.columns: if "geographic_level" in df.columns:
df = df[df["geographic_level"] == "School"] df = df[df["geographic_level"] == "School"]