Two bugs prevented historical secondary school data from loading:
1. stg_ees_ks4.sql filtered breakdown_topic = 'Total' only, but EES
releases prior to 2023/24 use breakdown_topic = 'All pupils' (matching
the KS2 convention). All older years were silently dropped to zero rows.
Fix: accept both values with an IN clause.
2. get_all_releases() in tap-uk-ees fetched only the first page of the
EES releases API. Now follows all pages via the paging.totalPages field
so no historical release is missed when more than 20 exist.
After re-running the annual EES pipeline, secondary school comparison
charts should show data across all available years (2018/19 onwards).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces computed means from our school dataset with the published DfE
national headline figures for the KS2 chart reference line.
- tap-uk-ees: new EESKs2NationalStream fetches the stable EES data-catalogue
CSV (one row per year, England national total, AllSchools filter)
- dbt staging: stg_ees_ks2_national normalises columns, casts to float,
filters to years >= 201617
- dbt mart: fact_ks2_national_averages — one row per year, official figures
- backend/models: Ks2NationalAverage SQLAlchemy model
- backend/app: /api/national-averages queries the mart for KS2 by_year;
secondary by_year stays computed (no DfE KS4 national dataset yet)
- DAG: extract_ks2_national task added to school_data_annual_ees,
runs in parallel with the main EES extract
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Expand the abbreviation in metric names (backend schemas), the home page
sort dropdown, README/QA docs, and pipeline comments. Short_name fields
and the compact row/map-card labels remain abbreviated for space.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Old DfE CSVs encode percentages as "57%" not "57". The safe_numeric
macro rejects non-numeric strings, so strip the suffix before emitting.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The file hosting uses non-deterministic URLs, so replace legacy_ks2_base_url
+ legacy_ks2_years with a single legacy_ks2_urls object mapping year codes
to download URLs. Configure the 4 pre-COVID years in meltano.yml.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add LegacyKS2Stream to tap-uk-ees: downloads old DfE england_ks2final.csv
files from a configurable base URL, maps 318-column wide format to the
same schema as stg_ees_ks2 output
- Add stg_legacy_ks2.sql staging model with safe_numeric casts
- Add legacy_ks2 source to _stg_sources.yml
- Update int_ks2_with_lineage.sql to union EES + legacy data
- Configurable via legacy_ks2_base_url and legacy_ks2_years tap settings
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Older census CSVs use 'URN' (uppercase) while the stream expects 'urn'.
Normalise the column name before filtering and emitting records.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Older census (and other) files don't include a time_period column.
Derive it from the release slug (e.g. '2022-23' → '202223') and inject
it into records so the required Singer schema field is always present.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add get_all_release_ids() to paginate /publications/{slug}/releases and
iterate over every release in get_records(). Add latest_only config flag
(default false) to restore single-release behaviour for daily runs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The preamble row in Ofsted CSVs contains 'turn off all filters' which
matched 'urn' in line.lower(), so header_idx was set to 0 instead of
the real header row. Use a regex that matches URN only as a CSV field.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
tap-uk-ees: EESCensusStream now declares 27 data columns (FSM %, EAL %,
ethnicity breakdowns, pupil counts) with clean Singer field names mapped
from the verbose CSV column names (e.g. '% of pupils known to be eligible
for free school meals' → fsm_pct) via a new _column_renames mechanism on
the base stream class.
stg_ees_census: materialised as table, applies safe_numeric to all
percentage/count columns, filters to numeric URNs.
int_pupil_chars_merged + fact_pupil_characteristics: pass all columns
through from staging (previously stubs with only 3 columns).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The admissions school-level file contains some rows with null school_urn
(LA/category aggregates that survive the geographic_level filter). These
cause a not-null constraint violation at target-postgres. Drop any row
where the URN column is null or empty before yielding records.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Admissions file is UTF-8 with BOM, not Latin-1. Reading as latin-1
decoded the BOM bytes as '' which wasn't stripped. Change admissions
encoding to utf-8-sig (strips BOM automatically). Also update the manual
BOM strip fallback to handle the latin-1 decoded form.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Some DfE supporting-files CSVs have a UTF-8 BOM on the first column,
causing it to be named '\ufefftime_period' instead of 'time_period'.
This trips Singer schema validation ('time_period' is a required property).
Strip the BOM from all column names after read_csv.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DfE supporting-files CSVs (spc_school_level_underlying_data, AppsandOffers
SchoolLevel) are Latin-1 encoded. Add _encoding class attribute to base
stream class and override to 'latin-1' for census and admissions streams.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Pipeline:
- EES tap: split KS4 into performance + info streams, fix admissions filename
(SchoolLevel keyword match), fix census filename (yearly suffix), remove
phonics (no school-level data on EES), change endswith → in for matching
- stg_ees_ks4: rewrite to filter long-format data and extract Attainment 8,
Progress 8, EBacc, English/Maths metrics; join KS4 info for context
- stg_ees_admissions: map real CSV columns (total_number_places_offered, etc.)
- stg_ees_census: update source reference, stub with TODO for data columns
- Remove stg_ees_phonics, fact_phonics (no school-level EES data)
- Add ees_ks4_performance + ees_ks4_info sources, remove ees_ks4 + ees_phonics
- Update int_ks4_with_lineage + fact_ks4_performance with new KS4 columns
- Annual EES DAG: remove stg_ees_phonics+ from selector
Backend:
- models.py: replace all models to point at marts.* tables with schema='marts'
(DimSchool, DimLocation, KS2Performance, FactOfstedInspection, etc.)
- data_loader.py: rewrite load_school_data_as_dataframe() using raw SQL joining
dim_school + dim_location + fact_ks2_performance; update get_supplementary_data()
- database.py: remove migration machinery, keep only connection setup
- app.py: remove check_and_migrate_if_needed, remove /api/admin/reimport-ks2
endpoints (pipeline handles all imports)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix publication slugs (KS4, Phonics, Admissions were wrong)
- Split KS2 into two streams: ees_ks2_attainment (long format) and
ees_ks2_info (wide format context data)
- Target specific filenames instead of keyword matching
- Handle school_urn vs urn column naming
- Pivot KS2 attainment from long to wide format in dbt staging
- Add all ~40 KS2 columns the backend needs (GPS, absence, gender,
disadvantaged breakdowns, context demographics)
- Pass through all columns in int_ks2_with_lineage and fact_ks2
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Declare all 34 columns needed by dbt in GIAS tap schema (target-postgres
only persists columns present in the Singer schema message)
- Use nullif() for empty-string-to-integer/date casts in staging models
- Scope daily DAG dbt build to GIAS models only (stg_gias_establishments+
stg_gias_links+) to avoid errors on unloaded sources
- Scope annual EES DAG similarly; remove redundant dbt test steps
- Make dim_school gracefully handle missing int_ofsted_latest table
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CSV is read with dtype=str so all values arrive as strings. Declaring
LA (code) and EstablishmentNumber as IntegerType caused schema
validation failures in target-postgres. Use StringType for all columns
except URN (which is explicitly cast to int for the primary key).
Type casting happens in dbt staging models.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Port critical patterns from the working integrator into Singer taps:
- GIAS: add 404 fallback to yesterday's date, increase timeout to 300s,
use latin-1 encoding, use dated URL for links (static URL returns 500)
- FBIT: add GIAS date fallback, increase timeout, fix encoding to latin-1
- IDACI: use dated GIAS URL with fallback instead of undated static URL,
fix encoding to latin-1, increase timeout to 300s
- Ofsted: try utf-8-sig then fall back to latin-1 encoding
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Port extraction logic from integrator scripts into Singer SDK taps:
- tap-uk-parent-view: scrapes Ofsted open data portal, parses survey responses (14 questions)
- tap-uk-fbit: queries FBIT API per-URN with rate limiting, computes per-pupil spend
- tap-uk-idaci: downloads IoD2019 XLSX, batch-resolves postcodes→LSOAs via postcodes.io
Update dbt models to match actual tap output schemas:
- stg_idaci now includes URN (tap does the postcode→LSOA→school join)
- stg_parent_view expanded from 8 to 13 question columns
- fact_deprivation simplified (no longer needs postcode→LSOA join in dbt)
- fact_parent_view expanded to include all 13 question metrics
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>