refactor(pipeline): unify KS2 and KS4 legacy sources to same annual ZIPs
Build and Push Docker Images / Build Backend (FastAPI) (push) Successful in 13s
Build and Push Docker Images / Build Frontend (Next.js) (push) Successful in 47s
Build and Push Docker Images / Build Pipeline (Meltano + dbt + Airflow) (push) Successful in 1m18s
Build and Push Docker Images / Trigger Portainer Update (push) Successful in 1s

LegacyKS2Stream now auto-detects ZIP vs bare CSV — if the download is a ZIP
it extracts england_ks2final.csv; if it's a plain CSV file it reads directly.
This keeps backwards compatibility while allowing both streams to share the
same DfE annual archive URLs.

legacy_ks2_urls updated to point at the same 4 ZIPs as legacy_ks4_urls so
only one set of archives needs to be maintained going forward.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Tudor Sitaru
2026-04-16 10:41:01 +01:00
parent 785cb72063
commit ae33bfe04b
2 changed files with 28 additions and 5 deletions
+4 -4
View File
@@ -31,10 +31,10 @@ plugins:
description: "Year code → URL mapping for legacy KS4 ZIPs (england_ks4final.csv inside)"
config:
legacy_ks2_urls:
"201516": "http://10.0.1.224:8081/filebrowser/api/public/dl/R9jjXFWa?inline=true"
"201617": "http://10.0.1.224:8081/filebrowser/api/public/dl/tIwJPVQS?inline=true"
"201718": "http://10.0.1.224:8081/filebrowser/api/public/dl/GO7SKE0p?inline=true"
"201819": "http://10.0.1.224:8081/filebrowser/api/public/dl/jchDEHsv?inline=true"
"201516": "http://10.0.1.224:8081/filebrowser/api/public/dl/iaoSkg1v?inline=true"
"201617": "http://10.0.1.224:8081/filebrowser/api/public/dl/bqCMUcIH?inline=true"
"201718": "http://10.0.1.224:8081/filebrowser/api/public/dl/0L61fE_a?inline=true"
"201819": "http://10.0.1.224:8081/filebrowser/api/public/dl/XJGJ5lG1?inline=true"
legacy_ks4_urls:
"201516": "http://10.0.1.224:8081/filebrowser/api/public/dl/iaoSkg1v?inline=true"
"201617": "http://10.0.1.224:8081/filebrowser/api/public/dl/bqCMUcIH?inline=true"
@@ -682,8 +682,31 @@ class LegacyKS2Stream(Stream):
self.logger.warning("Failed to download %s: %s", url, e)
continue
content = resp.content
# Auto-detect ZIP — the DfE annual archives contain both KS2 and KS4
# CSVs in one ZIP. If the download is a ZIP, extract england_ks2final.csv;
# otherwise treat the content as a bare CSV (legacy individual-file URLs).
csv_bytes = None
try:
zf = zipfile.ZipFile(io.BytesIO(content))
target = next(
(n for n in zf.namelist() if "ks2final" in n.lower() and n.endswith(".csv")),
None,
)
if target:
with zf.open(target) as f:
csv_bytes = f.read()
self.logger.info("Extracted %s from ZIP for %s", target, year_code)
else:
self.logger.warning("No ks2final CSV found in ZIP for %s", year_code)
continue
except zipfile.BadZipFile:
# Not a ZIP — treat as a bare CSV file
csv_bytes = content
df = pd.read_csv(
io.BytesIO(resp.content),
io.BytesIO(csv_bytes),
dtype=str,
keep_default_na=False,
encoding="latin-1",