3.0 KiB
Incremental Snapshot Downloader Design
Date: 2026-05-15 Status: Approved
Problem
The snapshot downloader currently produces a new HTML file on every daily run, named snapshots_{date_from}_to_{date_to}.html. Because date_to changes each day, files accumulate with near-identical content. The user wants a single file updated in place each day.
Approach
Incremental fetch with a JSON data cache and a state file. On each run, only new snapshots are fetched from the API; they are merged into a persistent cache of all snapshot objects; the single output HTML is re-rendered from the full cache.
Data Files
All written to output_dir:
| File | Purpose |
|---|---|
snapshots_cache.json |
Array of every snapshot object fetched so far |
last_run.json |
{"last_date_to": "YYYY-MM-DD"} — upper bound of the last successful fetch |
snapshots.html |
Single output file, fixed name, overwritten each run |
Run Logic
- Read
last_run.json→ uselast_date_toasdate_fromfor this run. Fall back to thedate_fromargument (from config) if no state file exists yet (first run). - Load
snapshots_cache.jsonif it exists → existing snapshot list. Empty list if absent. - Fetch new snapshots from the API with
date_from(from step 1) anddate_to = today. - Merge: append new snapshots to the cache list, deduplicating by snapshot
id. The boundary date may be returned by the API again on subsequent runs. - Save the updated
snapshots_cache.json. - Re-render
snapshots.htmlfrom the full merged list (sorted newest-first, existing logic unchanged). - Write
last_run.jsonwith today's date.
On fetch failure: steps 5–7 are skipped. The cache and state file remain unchanged, so the next run retries from the same date_from.
Code Changes
All changes are in src/snapshot_downloader.py. ConfigSnapshotDownloader is unchanged.
New methods on SnapshotDownloader
load_snapshot_cache() -> List[Dict]— readssnapshots_cache.json; returns[]if absent or malformedsave_snapshot_cache(snapshots: List[Dict])— writessnapshots_cache.jsonload_last_run_date() -> Optional[str]— readslast_run.json; returnsNoneif absentsave_last_run_date(date: str)— writeslast_run.json
Modified methods
download_snapshots(type_ids, date_from, date_to, max_pages)
date_from becomes the "initial history start" fallback (used only when last_run.json does not exist). date_to is always set to today's date, regardless of what is passed. The method orchestrates the 7-step incremental flow above.
generate_html_file(snapshots, date_from, date_to)
Output filename changes from snapshots_{date_from}_to_{date_to}.html to the fixed name snapshots.html.
Out of Scope
- Detecting and applying updates to existing snapshot objects (notes edits, etc.) — snapshots are treated as append-only
- Pruning the cache (old entries are kept indefinitely)
- Changes to
ConfigSnapshotDownloader, the web server, or any other module