Files
parentzone_downloader/docs/superpowers/specs/2026-05-15-incremental-snapshot-design.md
T
2026-05-15 15:24:39 +01:00

62 lines
3.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Incremental Snapshot Downloader Design
**Date:** 2026-05-15
**Status:** Approved
## Problem
The snapshot downloader currently produces a new HTML file on every daily run, named `snapshots_{date_from}_to_{date_to}.html`. Because `date_to` changes each day, files accumulate with near-identical content. The user wants a single file updated in place each day.
## Approach
Incremental fetch with a JSON data cache and a state file. On each run, only new snapshots are fetched from the API; they are merged into a persistent cache of all snapshot objects; the single output HTML is re-rendered from the full cache.
## Data Files
All written to `output_dir`:
| File | Purpose |
|---|---|
| `snapshots_cache.json` | Array of every snapshot object fetched so far |
| `last_run.json` | `{"last_date_to": "YYYY-MM-DD"}` — upper bound of the last successful fetch |
| `snapshots.html` | Single output file, fixed name, overwritten each run |
## Run Logic
1. Read `last_run.json` → use `last_date_to` as `date_from` for this run. Fall back to the `date_from` argument (from config) if no state file exists yet (first run).
2. Load `snapshots_cache.json` if it exists → existing snapshot list. Empty list if absent.
3. Fetch new snapshots from the API with `date_from` (from step 1) and `date_to = today`.
4. Merge: append new snapshots to the cache list, deduplicating by snapshot `id`. The boundary date may be returned by the API again on subsequent runs.
5. Save the updated `snapshots_cache.json`.
6. Re-render `snapshots.html` from the full merged list (sorted newest-first, existing logic unchanged).
7. Write `last_run.json` with today's date.
**On fetch failure:** steps 57 are skipped. The cache and state file remain unchanged, so the next run retries from the same `date_from`.
## Code Changes
All changes are in `src/snapshot_downloader.py`. `ConfigSnapshotDownloader` is unchanged.
### New methods on `SnapshotDownloader`
- `load_snapshot_cache() -> List[Dict]` — reads `snapshots_cache.json`; returns `[]` if absent or malformed
- `save_snapshot_cache(snapshots: List[Dict])` — writes `snapshots_cache.json`
- `load_last_run_date() -> Optional[str]` — reads `last_run.json`; returns `None` if absent
- `save_last_run_date(date: str)` — writes `last_run.json`
### Modified methods
**`download_snapshots(type_ids, date_from, date_to, max_pages)`**
`date_from` becomes the "initial history start" fallback (used only when `last_run.json` does not exist). `date_to` is always set to today's date, regardless of what is passed. The method orchestrates the 7-step incremental flow above.
**`generate_html_file(snapshots, date_from, date_to)`**
Output filename changes from `snapshots_{date_from}_to_{date_to}.html` to the fixed name `snapshots.html`.
## Out of Scope
- Detecting and applying updates to existing snapshot objects (notes edits, etc.) — snapshots are treated as append-only
- Pruning the cache (old entries are kept indefinitely)
- Changes to `ConfigSnapshotDownloader`, the web server, or any other module