refactor(pipeline): unify KS2 and KS4 legacy sources to same annual ZIPs
Build and Push Docker Images / Build Backend (FastAPI) (push) Successful in 13s
Build and Push Docker Images / Build Frontend (Next.js) (push) Successful in 47s
Build and Push Docker Images / Build Pipeline (Meltano + dbt + Airflow) (push) Successful in 1m18s
Build and Push Docker Images / Trigger Portainer Update (push) Successful in 1s
Build and Push Docker Images / Build Backend (FastAPI) (push) Successful in 13s
Build and Push Docker Images / Build Frontend (Next.js) (push) Successful in 47s
Build and Push Docker Images / Build Pipeline (Meltano + dbt + Airflow) (push) Successful in 1m18s
Build and Push Docker Images / Trigger Portainer Update (push) Successful in 1s
LegacyKS2Stream now auto-detects ZIP vs bare CSV — if the download is a ZIP it extracts england_ks2final.csv; if it's a plain CSV file it reads directly. This keeps backwards compatibility while allowing both streams to share the same DfE annual archive URLs. legacy_ks2_urls updated to point at the same 4 ZIPs as legacy_ks4_urls so only one set of archives needs to be maintained going forward. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -31,10 +31,10 @@ plugins:
|
||||
description: "Year code → URL mapping for legacy KS4 ZIPs (england_ks4final.csv inside)"
|
||||
config:
|
||||
legacy_ks2_urls:
|
||||
"201516": "http://10.0.1.224:8081/filebrowser/api/public/dl/R9jjXFWa?inline=true"
|
||||
"201617": "http://10.0.1.224:8081/filebrowser/api/public/dl/tIwJPVQS?inline=true"
|
||||
"201718": "http://10.0.1.224:8081/filebrowser/api/public/dl/GO7SKE0p?inline=true"
|
||||
"201819": "http://10.0.1.224:8081/filebrowser/api/public/dl/jchDEHsv?inline=true"
|
||||
"201516": "http://10.0.1.224:8081/filebrowser/api/public/dl/iaoSkg1v?inline=true"
|
||||
"201617": "http://10.0.1.224:8081/filebrowser/api/public/dl/bqCMUcIH?inline=true"
|
||||
"201718": "http://10.0.1.224:8081/filebrowser/api/public/dl/0L61fE_a?inline=true"
|
||||
"201819": "http://10.0.1.224:8081/filebrowser/api/public/dl/XJGJ5lG1?inline=true"
|
||||
legacy_ks4_urls:
|
||||
"201516": "http://10.0.1.224:8081/filebrowser/api/public/dl/iaoSkg1v?inline=true"
|
||||
"201617": "http://10.0.1.224:8081/filebrowser/api/public/dl/bqCMUcIH?inline=true"
|
||||
|
||||
@@ -682,8 +682,31 @@ class LegacyKS2Stream(Stream):
|
||||
self.logger.warning("Failed to download %s: %s", url, e)
|
||||
continue
|
||||
|
||||
content = resp.content
|
||||
|
||||
# Auto-detect ZIP — the DfE annual archives contain both KS2 and KS4
|
||||
# CSVs in one ZIP. If the download is a ZIP, extract england_ks2final.csv;
|
||||
# otherwise treat the content as a bare CSV (legacy individual-file URLs).
|
||||
csv_bytes = None
|
||||
try:
|
||||
zf = zipfile.ZipFile(io.BytesIO(content))
|
||||
target = next(
|
||||
(n for n in zf.namelist() if "ks2final" in n.lower() and n.endswith(".csv")),
|
||||
None,
|
||||
)
|
||||
if target:
|
||||
with zf.open(target) as f:
|
||||
csv_bytes = f.read()
|
||||
self.logger.info("Extracted %s from ZIP for %s", target, year_code)
|
||||
else:
|
||||
self.logger.warning("No ks2final CSV found in ZIP for %s", year_code)
|
||||
continue
|
||||
except zipfile.BadZipFile:
|
||||
# Not a ZIP — treat as a bare CSV file
|
||||
csv_bytes = content
|
||||
|
||||
df = pd.read_csv(
|
||||
io.BytesIO(resp.content),
|
||||
io.BytesIO(csv_bytes),
|
||||
dtype=str,
|
||||
keep_default_na=False,
|
||||
encoding="latin-1",
|
||||
|
||||
Reference in New Issue
Block a user