Add design spec for incremental snapshot downloader
This commit is contained in:
@@ -0,0 +1,61 @@
|
|||||||
|
# Incremental Snapshot Downloader Design
|
||||||
|
|
||||||
|
**Date:** 2026-05-15
|
||||||
|
**Status:** Approved
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
The snapshot downloader currently produces a new HTML file on every daily run, named `snapshots_{date_from}_to_{date_to}.html`. Because `date_to` changes each day, files accumulate with near-identical content. The user wants a single file updated in place each day.
|
||||||
|
|
||||||
|
## Approach
|
||||||
|
|
||||||
|
Incremental fetch with a JSON data cache and a state file. On each run, only new snapshots are fetched from the API; they are merged into a persistent cache of all snapshot objects; the single output HTML is re-rendered from the full cache.
|
||||||
|
|
||||||
|
## Data Files
|
||||||
|
|
||||||
|
All written to `output_dir`:
|
||||||
|
|
||||||
|
| File | Purpose |
|
||||||
|
|---|---|
|
||||||
|
| `snapshots_cache.json` | Array of every snapshot object fetched so far |
|
||||||
|
| `last_run.json` | `{"last_date_to": "YYYY-MM-DD"}` — upper bound of the last successful fetch |
|
||||||
|
| `snapshots.html` | Single output file, fixed name, overwritten each run |
|
||||||
|
|
||||||
|
## Run Logic
|
||||||
|
|
||||||
|
1. Read `last_run.json` → use `last_date_to` as `date_from` for this run. Fall back to the `date_from` argument (from config) if no state file exists yet (first run).
|
||||||
|
2. Load `snapshots_cache.json` if it exists → existing snapshot list. Empty list if absent.
|
||||||
|
3. Fetch new snapshots from the API with `date_from` (from step 1) and `date_to = today`.
|
||||||
|
4. Merge: append new snapshots to the cache list, deduplicating by snapshot `id`. The boundary date may be returned by the API again on subsequent runs.
|
||||||
|
5. Save the updated `snapshots_cache.json`.
|
||||||
|
6. Re-render `snapshots.html` from the full merged list (sorted newest-first, existing logic unchanged).
|
||||||
|
7. Write `last_run.json` with today's date.
|
||||||
|
|
||||||
|
**On fetch failure:** steps 5–7 are skipped. The cache and state file remain unchanged, so the next run retries from the same `date_from`.
|
||||||
|
|
||||||
|
## Code Changes
|
||||||
|
|
||||||
|
All changes are in `src/snapshot_downloader.py`. `ConfigSnapshotDownloader` is unchanged.
|
||||||
|
|
||||||
|
### New methods on `SnapshotDownloader`
|
||||||
|
|
||||||
|
- `load_snapshot_cache() -> List[Dict]` — reads `snapshots_cache.json`; returns `[]` if absent or malformed
|
||||||
|
- `save_snapshot_cache(snapshots: List[Dict])` — writes `snapshots_cache.json`
|
||||||
|
- `load_last_run_date() -> Optional[str]` — reads `last_run.json`; returns `None` if absent
|
||||||
|
- `save_last_run_date(date: str)` — writes `last_run.json`
|
||||||
|
|
||||||
|
### Modified methods
|
||||||
|
|
||||||
|
**`download_snapshots(type_ids, date_from, date_to, max_pages)`**
|
||||||
|
|
||||||
|
`date_from` becomes the "initial history start" fallback (used only when `last_run.json` does not exist). `date_to` is always set to today's date, regardless of what is passed. The method orchestrates the 7-step incremental flow above.
|
||||||
|
|
||||||
|
**`generate_html_file(snapshots, date_from, date_to)`**
|
||||||
|
|
||||||
|
Output filename changes from `snapshots_{date_from}_to_{date_to}.html` to the fixed name `snapshots.html`.
|
||||||
|
|
||||||
|
## Out of Scope
|
||||||
|
|
||||||
|
- Detecting and applying updates to existing snapshot objects (notes edits, etc.) — snapshots are treated as append-only
|
||||||
|
- Pruning the cache (old entries are kept indefinitely)
|
||||||
|
- Changes to `ConfigSnapshotDownloader`, the web server, or any other module
|
||||||
Reference in New Issue
Block a user