fix(taps): align with integrator resilience patterns
All checks were successful
Build and Push Docker Images / Build Backend (FastAPI) (push) Successful in 32s
Build and Push Docker Images / Build Frontend (Next.js) (push) Successful in 1m5s
Build and Push Docker Images / Build Integrator (push) Successful in 56s
Build and Push Docker Images / Build Kestra Init (push) Successful in 32s
Build and Push Docker Images / Build Pipeline (Meltano + dbt + Airflow) (push) Successful in 1m7s
Build and Push Docker Images / Trigger Portainer Update (push) Successful in 1s

Port critical patterns from the working integrator into Singer taps:
- GIAS: add 404 fallback to yesterday's date, increase timeout to 300s,
  use latin-1 encoding, use dated URL for links (static URL returns 500)
- FBIT: add GIAS date fallback, increase timeout, fix encoding to latin-1
- IDACI: use dated GIAS URL with fallback instead of undated static URL,
  fix encoding to latin-1, increase timeout to 300s
- Ofsted: try utf-8-sig then fall back to latin-1 encoding

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-03-26 11:13:38 +00:00
parent b6a487776b
commit cd75fc4c24
4 changed files with 66 additions and 21 deletions

View File

@@ -2,7 +2,7 @@
from __future__ import annotations
from datetime import date
from datetime import date, timedelta
from singer_sdk import Stream, Tap
from singer_sdk import typing as th
@@ -12,6 +12,11 @@ GIAS_URL_TEMPLATE = (
"/edubase/downloads/public/edubasealldata{date}.csv"
)
GIAS_LINKS_URL_TEMPLATE = (
"https://ea-edubase-api-prod.azurewebsites.net"
"/edubase/downloads/public/links_edubasealldata{date}.csv"
)
class GIASEstablishmentsStream(Stream):
"""Stream: GIAS establishments (one row per URN)."""
@@ -40,23 +45,30 @@ class GIASEstablishmentsStream(Stream):
import pandas as pd
import requests
today = date.today().strftime("%Y%m%d")
url = GIAS_URL_TEMPLATE.format(date=today)
today = date.today()
url = GIAS_URL_TEMPLATE.format(date=today.strftime("%Y%m%d"))
self.logger.info("Downloading GIAS bulk CSV: %s", url)
resp = requests.get(url, timeout=120)
resp = requests.get(url, timeout=300)
# GIAS may not have today's file yet — fall back to yesterday
if resp.status_code == 404:
yesterday = (today - timedelta(days=1)).strftime("%Y%m%d")
url = GIAS_URL_TEMPLATE.format(date=yesterday)
self.logger.info("Today's file not available, trying yesterday: %s", url)
resp = requests.get(url, timeout=300)
resp.raise_for_status()
df = pd.read_csv(
io.StringIO(resp.text),
encoding="utf-8-sig",
encoding="latin-1",
dtype=str,
keep_default_na=False,
)
for _, row in df.iterrows():
record = row.to_dict()
# Cast URN to int
try:
record["URN"] = int(record["URN"])
except (ValueError, KeyError):
@@ -85,18 +97,24 @@ class GIASLinksStream(Stream):
import pandas as pd
import requests
url = (
"https://ea-edubase-api-prod.azurewebsites.net"
"/edubase/downloads/public/links_edubasealldata.csv"
)
today = date.today()
url = GIAS_LINKS_URL_TEMPLATE.format(date=today.strftime("%Y%m%d"))
self.logger.info("Downloading GIAS links CSV: %s", url)
resp = requests.get(url, timeout=120)
resp = requests.get(url, timeout=300)
# Fall back to yesterday if today's file isn't available
if resp.status_code in (404, 500):
yesterday = (today - timedelta(days=1)).strftime("%Y%m%d")
url = GIAS_LINKS_URL_TEMPLATE.format(date=yesterday)
self.logger.info("Today's links file not available, trying yesterday: %s", url)
resp = requests.get(url, timeout=300)
resp.raise_for_status()
df = pd.read_csv(
io.StringIO(resp.text),
encoding="utf-8-sig",
encoding="latin-1",
dtype=str,
keep_default_na=False,
)