Commit Graph

23 Commits

Author SHA1 Message Date
Tudor Sitaru
2b757e556d fix(legacy-ks2): strip % suffix from percentage values
All checks were successful
Build and Push Docker Images / Build Backend (FastAPI) (push) Successful in 34s
Build and Push Docker Images / Build Frontend (Next.js) (push) Successful in 1m11s
Build and Push Docker Images / Build Pipeline (Meltano + dbt + Airflow) (push) Successful in 1m37s
Build and Push Docker Images / Trigger Portainer Update (push) Successful in 1s
Old DfE CSVs encode percentages as "57%" not "57". The safe_numeric
macro rejects non-numeric strings, so strip the suffix before emitting.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-01 13:07:51 +01:00
Tudor Sitaru
fba8e74b72 refactor(legacy-ks2): use explicit year→URL mapping instead of base URL pattern
All checks were successful
Build and Push Docker Images / Build Backend (FastAPI) (push) Successful in 34s
Build and Push Docker Images / Build Frontend (Next.js) (push) Successful in 1m9s
Build and Push Docker Images / Build Pipeline (Meltano + dbt + Airflow) (push) Successful in 32s
Build and Push Docker Images / Trigger Portainer Update (push) Successful in 0s
The file hosting uses non-deterministic URLs, so replace legacy_ks2_base_url
+ legacy_ks2_years with a single legacy_ks2_urls object mapping year codes
to download URLs. Configure the 4 pre-COVID years in meltano.yml.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 22:44:11 +01:00
Tudor Sitaru
6d4962639c feat(legacy-ks2): add stream for pre-COVID KS2 data (2015-2019)
All checks were successful
Build and Push Docker Images / Build Backend (FastAPI) (push) Successful in 46s
Build and Push Docker Images / Build Frontend (Next.js) (push) Successful in 1m17s
Build and Push Docker Images / Build Pipeline (Meltano + dbt + Airflow) (push) Successful in 2m26s
Build and Push Docker Images / Trigger Portainer Update (push) Successful in 1s
- Add LegacyKS2Stream to tap-uk-ees: downloads old DfE england_ks2final.csv
  files from a configurable base URL, maps 318-column wide format to the
  same schema as stg_ees_ks2 output
- Add stg_legacy_ks2.sql staging model with safe_numeric casts
- Add legacy_ks2 source to _stg_sources.yml
- Update int_ks2_with_lineage.sql to union EES + legacy data
- Configurable via legacy_ks2_base_url and legacy_ks2_years tap settings

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-31 14:36:41 +01:00
Tudor Sitaru
fc011c6547 fix(tap-uk-ees): case-insensitive URN column matching for older census files
All checks were successful
Build and Push Docker Images / Build Backend (FastAPI) (push) Successful in 32s
Build and Push Docker Images / Build Frontend (Next.js) (push) Successful in 1m10s
Build and Push Docker Images / Build Pipeline (Meltano + dbt + Airflow) (push) Successful in 1m48s
Build and Push Docker Images / Trigger Portainer Update (push) Successful in 0s
Older census CSVs use 'URN' (uppercase) while the stream expects 'urn'.
Normalise the column name before filtering and emitting records.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-30 22:36:16 +01:00
Tudor Sitaru
752abd69a5 fix(tap-uk-ees): inject time_period from release slug when absent in CSV
All checks were successful
Build and Push Docker Images / Build Backend (FastAPI) (push) Successful in 32s
Build and Push Docker Images / Build Frontend (Next.js) (push) Successful in 1m8s
Build and Push Docker Images / Build Pipeline (Meltano + dbt + Airflow) (push) Successful in 1m37s
Build and Push Docker Images / Trigger Portainer Update (push) Successful in 0s
Older census (and other) files don't include a time_period column.
Derive it from the release slug (e.g. '2022-23' → '202223') and inject
it into records so the required Singer schema field is always present.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 21:59:24 +01:00
Tudor Sitaru
570c2b689e fix(tap-uk-ees): handle plain list response from releases endpoint
All checks were successful
Build and Push Docker Images / Build Backend (FastAPI) (push) Successful in 33s
Build and Push Docker Images / Build Frontend (Next.js) (push) Successful in 1m6s
Build and Push Docker Images / Build Pipeline (Meltano + dbt + Airflow) (push) Successful in 1m45s
Build and Push Docker Images / Trigger Portainer Update (push) Successful in 1s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 21:47:14 +01:00
Tudor Sitaru
9a1572ea20 feat(tap-uk-ees): fetch all historical releases, not just latest
All checks were successful
Build and Push Docker Images / Build Backend (FastAPI) (push) Successful in 32s
Build and Push Docker Images / Build Frontend (Next.js) (push) Successful in 1m9s
Build and Push Docker Images / Build Pipeline (Meltano + dbt + Airflow) (push) Successful in 1m42s
Build and Push Docker Images / Trigger Portainer Update (push) Successful in 0s
Add get_all_release_ids() to paginate /publications/{slug}/releases and
iterate over every release in get_records(). Add latest_only config flag
(default false) to restore single-release behaviour for daily runs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-30 21:37:26 +01:00
250d1f7c77 fix(tap-uk-idaci): add openpyxl dependency for Excel file parsing
Some checks failed
Build and Push Docker Images / Build Backend (FastAPI) (push) Successful in 49s
Build and Push Docker Images / Build Frontend (Next.js) (push) Failing after 1m2s
Build and Push Docker Images / Trigger Portainer Update (push) Has been cancelled
Build and Push Docker Images / Build Pipeline (Meltano + dbt + Airflow) (push) Has been cancelled
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 15:00:00 +00:00
26aa3c2d70 fix(tap-uk-ofsted): fix header row detection matching 'urn' inside 'turn'
All checks were successful
Build and Push Docker Images / Build Backend (FastAPI) (push) Successful in 33s
Build and Push Docker Images / Build Frontend (Next.js) (push) Successful in 1m7s
Build and Push Docker Images / Build Pipeline (Meltano + dbt + Airflow) (push) Successful in 1m40s
Build and Push Docker Images / Trigger Portainer Update (push) Successful in 1s
The preamble row in Ofsted CSVs contains 'turn off all filters' which
matched 'urn' in line.lower(), so header_idx was set to 0 instead of
the real header row. Use a regex that matches URN only as a CSV field.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 17:05:03 +00:00
e56a63c59c debug(tap-uk-ofsted): log CSV column names to diagnose 0-record extraction
All checks were successful
Build and Push Docker Images / Build Backend (FastAPI) (push) Successful in 31s
Build and Push Docker Images / Build Frontend (Next.js) (push) Successful in 1m4s
Build and Push Docker Images / Build Pipeline (Meltano + dbt + Airflow) (push) Successful in 1m40s
Build and Push Docker Images / Trigger Portainer Update (push) Successful in 1s
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 15:47:32 +00:00
668e234eb2 feat(census): add demographic columns to EES census tap and staging models
All checks were successful
Build and Push Docker Images / Build Backend (FastAPI) (push) Successful in 32s
Build and Push Docker Images / Build Frontend (Next.js) (push) Successful in 1m7s
Build and Push Docker Images / Build Integrator (push) Successful in 55s
Build and Push Docker Images / Build Kestra Init (push) Successful in 32s
Build and Push Docker Images / Build Pipeline (Meltano + dbt + Airflow) (push) Successful in 1m39s
Build and Push Docker Images / Trigger Portainer Update (push) Successful in 1s
tap-uk-ees: EESCensusStream now declares 27 data columns (FSM %, EAL %,
ethnicity breakdowns, pupil counts) with clean Singer field names mapped
from the verbose CSV column names (e.g. '% of pupils known to be eligible
for free school meals' → fsm_pct) via a new _column_renames mechanism on
the base stream class.

stg_ees_census: materialised as table, applies safe_numeric to all
percentage/count columns, filters to numeric URNs.

int_pupil_chars_merged + fact_pupil_characteristics: pass all columns
through from staging (previously stubs with only 3 columns).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 14:07:48 +00:00
8e8d1bd8c5 fix(ees-tap): filter out rows with null URN before emitting
All checks were successful
Build and Push Docker Images / Build Backend (FastAPI) (push) Successful in 32s
Build and Push Docker Images / Build Frontend (Next.js) (push) Successful in 1m10s
Build and Push Docker Images / Build Integrator (push) Successful in 56s
Build and Push Docker Images / Build Kestra Init (push) Successful in 32s
Build and Push Docker Images / Build Pipeline (Meltano + dbt + Airflow) (push) Successful in 1m47s
Build and Push Docker Images / Trigger Portainer Update (push) Successful in 1s
The admissions school-level file contains some rows with null school_urn
(LA/category aggregates that survive the geographic_level filter). These
cause a not-null constraint violation at target-postgres. Drop any row
where the URN column is null or empty before yielding records.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 10:13:17 +00:00
c7357336e3 fix(ees-tap): fix BOM handling for admissions CSV
All checks were successful
Build and Push Docker Images / Build Backend (FastAPI) (push) Successful in 33s
Build and Push Docker Images / Build Frontend (Next.js) (push) Successful in 1m6s
Build and Push Docker Images / Build Integrator (push) Successful in 57s
Build and Push Docker Images / Build Kestra Init (push) Successful in 32s
Build and Push Docker Images / Build Pipeline (Meltano + dbt + Airflow) (push) Successful in 1m40s
Build and Push Docker Images / Trigger Portainer Update (push) Successful in 1s
Admissions file is UTF-8 with BOM, not Latin-1. Reading as latin-1
decoded the BOM bytes as '' which wasn't stripped. Change admissions
encoding to utf-8-sig (strips BOM automatically). Also update the manual
BOM strip fallback to handle the latin-1 decoded form.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 10:03:17 +00:00
b8ecc5c58b fix(ees-tap): strip UTF-8 BOM from CSV column names
All checks were successful
Build and Push Docker Images / Build Backend (FastAPI) (push) Successful in 32s
Build and Push Docker Images / Build Frontend (Next.js) (push) Successful in 1m12s
Build and Push Docker Images / Build Integrator (push) Successful in 55s
Build and Push Docker Images / Build Kestra Init (push) Successful in 31s
Build and Push Docker Images / Build Pipeline (Meltano + dbt + Airflow) (push) Successful in 1m42s
Build and Push Docker Images / Trigger Portainer Update (push) Successful in 0s
Some DfE supporting-files CSVs have a UTF-8 BOM on the first column,
causing it to be named '\ufefftime_period' instead of 'time_period'.
This trips Singer schema validation ('time_period' is a required property).
Strip the BOM from all column names after read_csv.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 09:54:15 +00:00
f4f0257447 fix(ees-tap): add latin-1 encoding for census/admissions, default utf-8 for others
All checks were successful
Build and Push Docker Images / Build Backend (FastAPI) (push) Successful in 52s
Build and Push Docker Images / Build Frontend (Next.js) (push) Successful in 1m8s
Build and Push Docker Images / Build Integrator (push) Successful in 55s
Build and Push Docker Images / Build Kestra Init (push) Successful in 31s
Build and Push Docker Images / Build Pipeline (Meltano + dbt + Airflow) (push) Successful in 1m40s
Build and Push Docker Images / Trigger Portainer Update (push) Successful in 0s
DfE supporting-files CSVs (spc_school_level_underlying_data, AppsandOffers
SchoolLevel) are Latin-1 encoded. Add _encoding class attribute to base
stream class and override to 'latin-1' for census and admissions streams.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 09:41:40 +00:00
ca351e9d73 feat: migrate backend to marts schema, update EES tap for verified datasets
Pipeline:
- EES tap: split KS4 into performance + info streams, fix admissions filename
  (SchoolLevel keyword match), fix census filename (yearly suffix), remove
  phonics (no school-level data on EES), change endswith → in for matching
- stg_ees_ks4: rewrite to filter long-format data and extract Attainment 8,
  Progress 8, EBacc, English/Maths metrics; join KS4 info for context
- stg_ees_admissions: map real CSV columns (total_number_places_offered, etc.)
- stg_ees_census: update source reference, stub with TODO for data columns
- Remove stg_ees_phonics, fact_phonics (no school-level EES data)
- Add ees_ks4_performance + ees_ks4_info sources, remove ees_ks4 + ees_phonics
- Update int_ks4_with_lineage + fact_ks4_performance with new KS4 columns
- Annual EES DAG: remove stg_ees_phonics+ from selector

Backend:
- models.py: replace all models to point at marts.* tables with schema='marts'
  (DimSchool, DimLocation, KS2Performance, FactOfstedInspection, etc.)
- data_loader.py: rewrite load_school_data_as_dataframe() using raw SQL joining
  dim_school + dim_location + fact_ks2_performance; update get_supplementary_data()
- database.py: remove migration machinery, keep only connection setup
- app.py: remove check_and_migrate_if_needed, remove /api/admin/reimport-ks2
  endpoints (pipeline handles all imports)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 09:29:27 +00:00
d82e36e7b2 feat(ees): rewrite EES tap and KS2 models for actual data structure
All checks were successful
Build and Push Docker Images / Build Backend (FastAPI) (push) Successful in 31s
Build and Push Docker Images / Build Frontend (Next.js) (push) Successful in 1m8s
Build and Push Docker Images / Build Integrator (push) Successful in 55s
Build and Push Docker Images / Build Kestra Init (push) Successful in 32s
Build and Push Docker Images / Build Pipeline (Meltano + dbt + Airflow) (push) Successful in 1m45s
Build and Push Docker Images / Trigger Portainer Update (push) Successful in 1s
- Fix publication slugs (KS4, Phonics, Admissions were wrong)
- Split KS2 into two streams: ees_ks2_attainment (long format) and
  ees_ks2_info (wide format context data)
- Target specific filenames instead of keyword matching
- Handle school_urn vs urn column naming
- Pivot KS2 attainment from long to wide format in dbt staging
- Add all ~40 KS2 columns the backend needs (GPS, absence, gender,
  disadvantaged breakdowns, context demographics)
- Pass through all columns in int_ks2_with_lineage and fact_ks2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-26 23:08:50 +00:00
e7b1ab9f37 fix(pipeline): expand GIAS schema, handle empty strings, scope DAG selectors
All checks were successful
Build and Push Docker Images / Build Backend (FastAPI) (push) Successful in 32s
Build and Push Docker Images / Build Frontend (Next.js) (push) Successful in 1m8s
Build and Push Docker Images / Build Integrator (push) Successful in 57s
Build and Push Docker Images / Build Kestra Init (push) Successful in 34s
Build and Push Docker Images / Build Pipeline (Meltano + dbt + Airflow) (push) Successful in 1m39s
Build and Push Docker Images / Trigger Portainer Update (push) Successful in 1s
- Declare all 34 columns needed by dbt in GIAS tap schema (target-postgres
  only persists columns present in the Singer schema message)
- Use nullif() for empty-string-to-integer/date casts in staging models
- Scope daily DAG dbt build to GIAS models only (stg_gias_establishments+
  stg_gias_links+) to avoid errors on unloaded sources
- Scope annual EES DAG similarly; remove redundant dbt test steps
- Make dim_school gracefully handle missing int_ofsted_latest table

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-26 20:43:24 +00:00
0062a5eabe fix(tap-gias): declare numeric CSV columns as StringType
Some checks failed
Build and Push Docker Images / Build Backend (FastAPI) (push) Successful in 35s
Build and Push Docker Images / Build Frontend (Next.js) (push) Successful in 1m7s
Build and Push Docker Images / Build Integrator (push) Failing after 30s
Build and Push Docker Images / Build Kestra Init (push) Failing after 30s
Build and Push Docker Images / Build Pipeline (Meltano + dbt + Airflow) (push) Failing after 29s
Build and Push Docker Images / Trigger Portainer Update (push) Has been skipped
CSV is read with dtype=str so all values arrive as strings. Declaring
LA (code) and EstablishmentNumber as IntegerType caused schema
validation failures in target-postgres. Use StringType for all columns
except URN (which is explicitly cast to int for the primary key).
Type casting happens in dbt staging models.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-26 14:03:26 +00:00
cd75fc4c24 fix(taps): align with integrator resilience patterns
All checks were successful
Build and Push Docker Images / Build Backend (FastAPI) (push) Successful in 32s
Build and Push Docker Images / Build Frontend (Next.js) (push) Successful in 1m5s
Build and Push Docker Images / Build Integrator (push) Successful in 56s
Build and Push Docker Images / Build Kestra Init (push) Successful in 32s
Build and Push Docker Images / Build Pipeline (Meltano + dbt + Airflow) (push) Successful in 1m7s
Build and Push Docker Images / Trigger Portainer Update (push) Successful in 1s
Port critical patterns from the working integrator into Singer taps:
- GIAS: add 404 fallback to yesterday's date, increase timeout to 300s,
  use latin-1 encoding, use dated URL for links (static URL returns 500)
- FBIT: add GIAS date fallback, increase timeout, fix encoding to latin-1
- IDACI: use dated GIAS URL with fallback instead of undated static URL,
  fix encoding to latin-1, increase timeout to 300s
- Ofsted: try utf-8-sig then fall back to latin-1 encoding

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-26 11:13:38 +00:00
97d975114a feat(pipeline): implement parent-view, fbit, idaci Singer taps + align staging/mart models
All checks were successful
Build and Push Docker Images / Build Backend (FastAPI) (push) Successful in 34s
Build and Push Docker Images / Build Frontend (Next.js) (push) Successful in 1m5s
Build and Push Docker Images / Build Integrator (push) Successful in 57s
Build and Push Docker Images / Build Kestra Init (push) Successful in 31s
Build and Push Docker Images / Build Pipeline (Meltano + dbt + Airflow) (push) Successful in 1m6s
Build and Push Docker Images / Trigger Portainer Update (push) Successful in 1s
Port extraction logic from integrator scripts into Singer SDK taps:
- tap-uk-parent-view: scrapes Ofsted open data portal, parses survey responses (14 questions)
- tap-uk-fbit: queries FBIT API per-URN with rate limiting, computes per-pupil spend
- tap-uk-idaci: downloads IoD2019 XLSX, batch-resolves postcodes→LSOAs via postcodes.io

Update dbt models to match actual tap output schemas:
- stg_idaci now includes URN (tap does the postcode→LSOA→school join)
- stg_parent_view expanded from 8 to 13 question columns
- fact_deprivation simplified (no longer needs postcode→LSOA join in dbt)
- fact_parent_view expanded to include all 13 question metrics

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-26 10:38:07 +00:00
deb4024731 chore(pipeline): bump all dependencies to latest stable versions
All checks were successful
Build and Push Docker Images / Build Backend (FastAPI) (push) Successful in 32s
Build and Push Docker Images / Build Frontend (Next.js) (push) Successful in 1m4s
Build and Push Docker Images / Build Integrator (push) Successful in 57s
Build and Push Docker Images / Build Kestra Init (push) Successful in 32s
Build and Push Docker Images / Build Pipeline (Meltano + dbt + Airflow) (push) Successful in 1m45s
Build and Push Docker Images / Trigger Portainer Update (push) Successful in 0s
- Airflow 2.11 → 3.1 (BashOperator moved to providers-standard)
- Meltano 3.5 → 4.1 (meltano.yml version 2, meltanolabs target-postgres)
- dbt-postgres 1.9 → 1.10
- singer-sdk 0.39 → 0.53 (all 6 taps)
- Typesense Docker 27.1 → 30.1
- Typesense Python client >=2.0
- Python base image 3.12 → 3.13

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-26 09:18:11 +00:00
8f02b5125e feat(pipeline): add Meltano + dbt + Airflow ELT pipeline scaffold
All checks were successful
Build and Push Docker Images / Build Backend (FastAPI) (push) Successful in 35s
Build and Push Docker Images / Build Frontend (Next.js) (push) Successful in 1m9s
Build and Push Docker Images / Build Integrator (push) Successful in 56s
Build and Push Docker Images / Build Kestra Init (push) Successful in 32s
Build and Push Docker Images / Trigger Portainer Update (push) Successful in 1s
Replaces the hand-rolled integrator with a production-grade ELT pipeline
using Meltano (Singer taps), dbt Core (medallion architecture), and
Apache Airflow (orchestration). Adds Typesense for search and PostGIS
for geospatial queries.

- 6 custom Singer taps (GIAS, EES, Ofsted, Parent View, FBIT, IDACI)
- dbt project: 12 staging, 5 intermediate, 12 mart models
- 3 Airflow DAGs (daily/monthly/annual schedules)
- Typesense sync + batch geocoding scripts
- docker-compose: add Airflow, Typesense; upgrade to PostGIS
- Portainer stack definition matching live deployment topology

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-26 08:37:53 +00:00