perf: resolve all P1–P5 performance issues from code review
Build and Push Docker Images / Build Backend (FastAPI) (push) Successful in 21s
Build and Push Docker Images / Build Frontend (Next.js) (push) Successful in 50s
Build and Push Docker Images / Build Pipeline (Meltano + dbt + Airflow) (push) Successful in 12s
Build and Push Docker Images / Trigger Portainer Update (push) Successful in 1s

P1 (backend/data_loader.py): Add load_latest_school_data() which pre-computes
the one-row-per-school latest-year snapshot (groupby, prev-year trend merge)
once at startup instead of on every /api/schools request. get_schools route
now starts from the cached snapshot rather than rebuilding it.

S3 (backend/app.py): Wrap synchronous geocode_single_postcode() call in
asyncio.to_thread() so postcode lookups no longer block the uvicorn event
loop. Admin reload endpoint also uses to_thread for both cache primes.

P2 (nextjs-app/components/HomeView.tsx): Add mapParamsRef guard so switching
back to map view does not re-fetch 500 schools when search params haven't
changed. Reset ref on new searches so fresh data is always fetched.

P3 (nextjs-app/lib/chartSetup.ts): Extract Chart.js registration into a
shared side-effect module. ComparisonChart and PerformanceChart now import
it instead of each calling ChartJS.register() independently.

P4 (backend/database.py): Remove unnecessary db.commit() from the read-only
get_db_session() context manager — saves a DB round-trip on every request.

P5 (backend/database.py): Add pool_recycle=1800 to SQLAlchemy engine to
prevent stale TCP connections from accumulating in long-running processes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Tudor Sitaru
2026-04-15 22:45:46 +01:00
parent f6b9d650f8
commit f05bbba613
7 changed files with 106 additions and 77 deletions
+53 -1
View File
@@ -242,6 +242,8 @@ def load_school_data_as_dataframe() -> pd.DataFrame:
# Cache for DataFrame
_df_cache: Optional[pd.DataFrame] = None
# Pre-computed latest-year snapshot (one row per school, with prev-year trend columns)
_df_latest_cache: Optional[pd.DataFrame] = None
def load_school_data() -> pd.DataFrame:
@@ -260,10 +262,60 @@ def load_school_data() -> pd.DataFrame:
return _df_cache
def load_latest_school_data() -> pd.DataFrame:
"""Return a cached one-row-per-school DataFrame at the latest available year.
The expensive groupby / merge / prev-year trend computation runs once at
startup (or after a cache clear) rather than on every search request.
Per-request filters (phase, gender, LA …) should be applied to the returned
DataFrame's copy; they must NOT modify the cached object.
"""
global _df_latest_cache
if _df_latest_cache is not None:
return _df_latest_cache
df = load_school_data()
if df.empty:
return df
# Schools that have no performance rows (PRUs, new schools, etc.)
df_no_perf = df[df["year"].isna()].drop_duplicates(subset=["urn"])
df_with_perf = df[df["year"].notna()]
# Reduce to the latest year per school
latest_year = df_with_perf.groupby("urn")["year"].max().reset_index()
df_latest = df_with_perf.merge(latest_year, on=["urn", "year"])
# Attach previous-year metrics for trend arrows (second-latest year per school)
df_sorted = df_with_perf.sort_values(["urn", "year"], ascending=[True, False])
df_prev = df_sorted.groupby("urn").nth(1).reset_index()
if not df_prev.empty and "rwm_expected_pct" in df_prev.columns:
prev_rwm = df_prev[["urn", "rwm_expected_pct"]].rename(
columns={"rwm_expected_pct": "prev_rwm_expected_pct"}
)
if "attainment_8_score" in df_prev.columns:
prev_rwm = prev_rwm.merge(
df_prev[["urn", "attainment_8_score"]].rename(
columns={"attainment_8_score": "prev_attainment_8_score"}
),
on="urn",
how="outer",
)
df_latest = df_latest.merge(prev_rwm, on="urn", how="left")
# Merge back schools with no performance data
df_latest = pd.concat([df_latest, df_no_perf], ignore_index=True)
print(f"Latest-snapshot cache built: {len(df_latest)} schools")
_df_latest_cache = df_latest
return _df_latest_cache
def clear_cache():
"""Clear all caches."""
global _df_cache
global _df_cache, _df_latest_cache
_df_cache = None
_df_latest_cache = None
# =============================================================================