Pipeline:
- EES tap: split KS4 into performance + info streams, fix admissions filename
(SchoolLevel keyword match), fix census filename (yearly suffix), remove
phonics (no school-level data on EES), change endswith → in for matching
- stg_ees_ks4: rewrite to filter long-format data and extract Attainment 8,
Progress 8, EBacc, English/Maths metrics; join KS4 info for context
- stg_ees_admissions: map real CSV columns (total_number_places_offered, etc.)
- stg_ees_census: update source reference, stub with TODO for data columns
- Remove stg_ees_phonics, fact_phonics (no school-level EES data)
- Add ees_ks4_performance + ees_ks4_info sources, remove ees_ks4 + ees_phonics
- Update int_ks4_with_lineage + fact_ks4_performance with new KS4 columns
- Annual EES DAG: remove stg_ees_phonics+ from selector
Backend:
- models.py: replace all models to point at marts.* tables with schema='marts'
(DimSchool, DimLocation, KS2Performance, FactOfstedInspection, etc.)
- data_loader.py: rewrite load_school_data_as_dataframe() using raw SQL joining
dim_school + dim_location + fact_ks2_performance; update get_supplementary_data()
- database.py: remove migration machinery, keep only connection setup
- app.py: remove check_and_migrate_if_needed, remove /api/admin/reimport-ks2
endpoints (pipeline handles all imports)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix publication slugs (KS4, Phonics, Admissions were wrong)
- Split KS2 into two streams: ees_ks2_attainment (long format) and
ees_ks2_info (wide format context data)
- Target specific filenames instead of keyword matching
- Handle school_urn vs urn column naming
- Pivot KS2 attainment from long to wide format in dbt staging
- Add all ~40 KS2 columns the backend needs (GPS, absence, gender,
disadvantaged breakdowns, context demographics)
- Pass through all columns in int_ks2_with_lineage and fact_ks2
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove optional flag from total_pupils (Typesense requires default
sorting field to be non-optional)
- Add latitude/longitude columns to dim_location computed from PostGIS
geom, for direct use by backend and Typesense sync
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Typesense requires numeric default_sorting_field — use total_pupils
- Dynamically include KS2/KS4 joins only if those tables exist
- Extract lat/lng from PostGIS geom and populate Typesense geopoint field
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PostGIS extension lives in public schema; marts schema can't resolve
unqualified ST_MakePoint/ST_Transform calls.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
GIAS grid references are the actual school location — far more accurate
than postcode centroids. Remove geocode_postcodes.py from the daily DAG
and the postcode-not-null filter from dim_location.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Convert GIAS British National Grid coordinates (EPSG:27700) to WGS84
(EPSG:4326) directly in the dbt model. The geocode script backfills
schools missing easting/northing via Postcodes.io.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dbt default prepends the profile schema as prefix (public_staging,
public_marts). Override to use custom schema names directly (staging,
marts) so scripts can reference marts.dim_location correctly.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Lineage map includes predecessor URNs for closed schools, which are
correctly excluded from dim_school (status = 'Open').
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
GIAS CSV dates are DD-MM-YYYY format — use to_date() instead of cast().
Exclude int_ks2_with_lineage+ and int_ks4_with_lineage+ from daily DAG
selector since they depend on EES data not yet loaded.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Declare all 34 columns needed by dbt in GIAS tap schema (target-postgres
only persists columns present in the Singer schema message)
- Use nullif() for empty-string-to-integer/date casts in staging models
- Scope daily DAG dbt build to GIAS models only (stg_gias_establishments+
stg_gias_links+) to avoid errors on unloaded sources
- Scope annual EES DAG similarly; remove redundant dbt test steps
- Make dim_school gracefully handle missing int_ofsted_latest table
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
GIAS tap emits uppercase URN column — add quote: true so dbt source tests
reference "URN" instead of urn. Remove source-level tests from tables not yet
loaded (ofsted, ees, parent_view, fbit, idaci) to prevent relation-not-found
errors during dbt build.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CSV is read with dtype=str so all values arrive as strings. Declaring
LA (code) and EstablishmentNumber as IntegerType caused schema
validation failures in target-postgres. Use StringType for all columns
except URN (which is explicitly cast to int for the primary key).
Type casting happens in dbt staging models.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Meltano 4.x requires an environment to be specified. Set production as
the default. Also remove the deprecated 'version: 2' field.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The meltanolabs target-postgres variant expects 'database' as the
config key, not 'dbname' (which was the pipelinewise variant's key).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The `catalog` capability forced Meltano to run --discover and generate
a catalog file (tap.properties.json) before each extraction. This fails
because our Singer SDK taps emit schemas inline and don't need external
catalog files. Removing the capability makes Meltano invoke taps
directly without catalog generation.
Also switch from deprecated `meltano elt` to `meltano run` for
Meltano 4.x compatibility.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Meltano elt requires catalog files (tap.properties.json) to exist.
These are generated by `meltano install` which discovers tap schemas
and installs the target-postgres loader. Without this step, `meltano
elt` fails with "catalog file is missing".
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Port critical patterns from the working integrator into Singer taps:
- GIAS: add 404 fallback to yesterday's date, increase timeout to 300s,
use latin-1 encoding, use dated URL for links (static URL returns 500)
- FBIT: add GIAS date fallback, increase timeout, fix encoding to latin-1
- IDACI: use dated GIAS URL with fallback instead of undated static URL,
fix encoding to latin-1, increase timeout to 300s
- Ofsted: try utf-8-sig then fall back to latin-1 encoding
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add AIRFLOW__CORE__DAGS_FOLDER env var in Dockerfile so it's always set
- Run `airflow dags reserialize` after `db migrate` in init container so
DAGs appear immediately without waiting for scheduler scan interval
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- MELTANO_BIN/DBT_BIN pointed to .venv/bin/ but Dockerfile installs globally
- Add try/except for BashOperator import to handle both Airflow 3 provider
path and legacy path, preventing silent DAG import failures
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Port extraction logic from integrator scripts into Singer SDK taps:
- tap-uk-parent-view: scrapes Ofsted open data portal, parses survey responses (14 questions)
- tap-uk-fbit: queries FBIT API per-URN with rate limiting, computes per-pupil spend
- tap-uk-idaci: downloads IoD2019 XLSX, batch-resolves postcodes→LSOAs via postcodes.io
Update dbt models to match actual tap output schemas:
- stg_idaci now includes URN (tap does the postcode→LSOA→school join)
- stg_parent_view expanded from 8 to 13 question columns
- fact_deprivation simplified (no longer needs postcode→LSOA join in dbt)
- fact_parent_view expanded to include all 13 question metrics
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Airflow 3 replaced `airflow webserver` with `airflow api-server` and
removed the `airflow users` CLI. Auth is now via SimpleAuthManager
configured through AIRFLOW__CORE__SIMPLE_AUTH_MANAGER_USERS env var.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>