Files
parentzone_downloader/docs/superpowers/specs/2026-05-15-incremental-snapshot-design.md
2026-05-15 15:24:39 +01:00

3.0 KiB
Raw Permalink Blame History

Incremental Snapshot Downloader Design

Date: 2026-05-15 Status: Approved

Problem

The snapshot downloader currently produces a new HTML file on every daily run, named snapshots_{date_from}_to_{date_to}.html. Because date_to changes each day, files accumulate with near-identical content. The user wants a single file updated in place each day.

Approach

Incremental fetch with a JSON data cache and a state file. On each run, only new snapshots are fetched from the API; they are merged into a persistent cache of all snapshot objects; the single output HTML is re-rendered from the full cache.

Data Files

All written to output_dir:

File Purpose
snapshots_cache.json Array of every snapshot object fetched so far
last_run.json {"last_date_to": "YYYY-MM-DD"} — upper bound of the last successful fetch
snapshots.html Single output file, fixed name, overwritten each run

Run Logic

  1. Read last_run.json → use last_date_to as date_from for this run. Fall back to the date_from argument (from config) if no state file exists yet (first run).
  2. Load snapshots_cache.json if it exists → existing snapshot list. Empty list if absent.
  3. Fetch new snapshots from the API with date_from (from step 1) and date_to = today.
  4. Merge: append new snapshots to the cache list, deduplicating by snapshot id. The boundary date may be returned by the API again on subsequent runs.
  5. Save the updated snapshots_cache.json.
  6. Re-render snapshots.html from the full merged list (sorted newest-first, existing logic unchanged).
  7. Write last_run.json with today's date.

On fetch failure: steps 57 are skipped. The cache and state file remain unchanged, so the next run retries from the same date_from.

Code Changes

All changes are in src/snapshot_downloader.py. ConfigSnapshotDownloader is unchanged.

New methods on SnapshotDownloader

  • load_snapshot_cache() -> List[Dict] — reads snapshots_cache.json; returns [] if absent or malformed
  • save_snapshot_cache(snapshots: List[Dict]) — writes snapshots_cache.json
  • load_last_run_date() -> Optional[str] — reads last_run.json; returns None if absent
  • save_last_run_date(date: str) — writes last_run.json

Modified methods

download_snapshots(type_ids, date_from, date_to, max_pages)

date_from becomes the "initial history start" fallback (used only when last_run.json does not exist). date_to is always set to today's date, regardless of what is passed. The method orchestrates the 7-step incremental flow above.

generate_html_file(snapshots, date_from, date_to)

Output filename changes from snapshots_{date_from}_to_{date_to}.html to the fixed name snapshots.html.

Out of Scope

  • Detecting and applying updates to existing snapshot objects (notes edits, etc.) — snapshots are treated as append-only
  • Pruning the cache (old entries are kept indefinitely)
  • Changes to ConfigSnapshotDownloader, the web server, or any other module