adding claude md initial file

fix: remove double authenticate, remove debug print, fix display date
test: verify state file not updated when HTML generation fails
2026-05-16 20:34:42 +01:00 · 2026-05-16 20:29:22 +01:00 · 2026-05-16 20:26:10 +01:00 · 2026-05-16 20:22:54 +01:00 · 2026-05-16 20:17:30 +01:00 · 2026-05-16 20:15:24 +01:00
5 changed files with 797 additions and 47 deletions
@@ -0,0 +1,55 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Commands
+
+**Install dependencies:**
+```bash
+pip install -r requirements.txt
+```
+
+**Run a downloader manually:**
+```bash
+python3 src/config_downloader.py --config config/parentzone_config.json
+python3 src/config_snapshot_downloader.py --config config/snapshot_config.json
+```
+
+**Start the web server:**
+```bash
+python3 src/webserver.py --host 0.0.0.0 --port 8080 --snapshots-dir ./data/snapshots
+```
+
+**Run a test script:**
+```bash
+PYTHONPATH=. python3 tests/test_snapshot_downloader.py
+PYTHONPATH=. python3 tests/test_asset_tracking.py
+```
+
+**Docker:**
+```bash
+docker-compose up --build
+```
+
+## Architecture
+
+All source code lives in `src/`. The module hierarchy has two layers:
+
+**Core layer** — the main async classes that talk directly to the ParentZone API:
+- `ImageDownloader` — fetches asset lists from a gallery endpoint, downloads media files concurrently (semaphore-controlled), and preserves original `updated` timestamps on disk
+- `SnapshotDownloader` — fetches daily event snapshots with cursor-based pagination and renders them to self-contained HTML files (inline assets)
+- `AuthManager` — handles both auth modes: API key (`x-api-key` header for list, `key` param for download URLs) and login-based (POST to `/v1/auth/login`, session token in subsequent requests)
+- `AssetTracker` — persists a JSON sidecar file per output directory tracking asset IDs and metadata to skip already-downloaded files
+- `SnapshotsWebServer` — aiohttp web server that serves the generated HTML reports with directory listing
+
+**Config layer** — thin wrappers that load JSON config files and delegate to the core layer:
+- `ConfigImageDownloader` — loads `parentzone_config.json`, calls `ImageDownloader`
+- `ConfigSnapshotDownloader` — loads `snapshot_config.json`, calls `SnapshotDownloader`
+
+These config-layer modules are the actual entry points invoked by `scripts/scheduler.sh` in the Docker container.
+
+**Deployment flow:** Docker container runs `scripts/startup.sh`, which starts `cron` (scheduled via `scripts/crontab`) and the web server. The cron job calls `scripts/scheduler.sh` nightly, which runs both config-layer downloaders.
+
+**Tests** are standalone integration-style scripts (not pytest) in `tests/`, each with their own mock server or test runner class. Run with `PYTHONPATH=.` so imports resolve from the repo root.
+
+**Config files** in `config/` are gitignored (contain credentials). Use `config_example.json` and `snapshot_config_example.json` as templates.
@@ -0,0 +1,421 @@
+# Incremental Snapshot Downloader Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Replace the date-stamped multi-file output with a single `snapshots.html` updated incrementally each daily run.
+
+**Architecture:** Add cache/state I/O methods to `SnapshotDownloader`, change the output filename to a fixed `snapshots.html`, and rewrite `download_snapshots` to load an existing JSON cache, fetch only new snapshots since the last run, merge and deduplicate by `id`, then re-render the full HTML from the merged list.
+
+**Tech Stack:** Python 3.13, aiohttp, aiofiles, pytest, unittest.mock
+
+---
+
+## File Map
+
+| File | Change |
+|---|---|
+| `src/snapshot_downloader.py` | Add 4 I/O methods; modify `generate_html_file` and `download_snapshots` |
+| `tests/test_incremental_snapshot.py` | New — unit tests for all new and modified behaviour |
+
+---
+
+## Task 1: Cache and state file I/O methods
+
+**Files:**
+- Modify: `src/snapshot_downloader.py`
+- Create: `tests/test_incremental_snapshot.py`
+
+- [ ] **Step 1: Install pytest**
+
+```bash
+pip install pytest
+```
+
+Expected: pytest installed (confirm with `pytest --version`).
+
+- [ ] **Step 2: Create `tests/test_incremental_snapshot.py` with failing tests for all four I/O methods**
+
+```python
+import asyncio
+import json
+import pytest
+from unittest.mock import AsyncMock, patch
+
+from src.snapshot_downloader import SnapshotDownloader
+
+
+def _downloader(tmp_path):
+    return SnapshotDownloader(output_dir=str(tmp_path), api_key="test-key")
+
+
+# --- load_snapshot_cache ---
+
+def test_load_snapshot_cache_missing(tmp_path):
+    assert _downloader(tmp_path).load_snapshot_cache() == []
+
+
+def test_load_snapshot_cache_returns_data(tmp_path):
+    d = _downloader(tmp_path)
+    snapshots = [{"id": "1", "notes": "hello"}]
+    (tmp_path / "snapshots_cache.json").write_text(json.dumps(snapshots))
+    assert d.load_snapshot_cache() == snapshots
+
+
+def test_load_snapshot_cache_malformed_returns_empty(tmp_path):
+    d = _downloader(tmp_path)
+    (tmp_path / "snapshots_cache.json").write_text("not json{{{")
+    assert d.load_snapshot_cache() == []
+
+
+def test_load_snapshot_cache_non_list_returns_empty(tmp_path):
+    d = _downloader(tmp_path)
+    (tmp_path / "snapshots_cache.json").write_text('{"key": "val"}')
+    assert d.load_snapshot_cache() == []
+
+
+# --- save_snapshot_cache ---
+
+def test_save_snapshot_cache_writes_json(tmp_path):
+    d = _downloader(tmp_path)
+    snapshots = [{"id": "1"}, {"id": "2"}]
+    d.save_snapshot_cache(snapshots)
+    data = json.loads((tmp_path / "snapshots_cache.json").read_text())
+    assert data == snapshots
+
+
+# --- load_last_run_date ---
+
+def test_load_last_run_date_missing(tmp_path):
+    assert _downloader(tmp_path).load_last_run_date() is None
+
+
+def test_load_last_run_date_returns_date(tmp_path):
+    d = _downloader(tmp_path)
+    (tmp_path / "last_run.json").write_text('{"last_date_to": "2025-01-01"}')
+    assert d.load_last_run_date() == "2025-01-01"
+
+
+def test_load_last_run_date_malformed_returns_none(tmp_path):
+    d = _downloader(tmp_path)
+    (tmp_path / "last_run.json").write_text("not json")
+    assert d.load_last_run_date() is None
+
+
+# --- save_last_run_date ---
+
+def test_save_last_run_date_writes_json(tmp_path):
+    d = _downloader(tmp_path)
+    d.save_last_run_date("2025-06-01")
+    data = json.loads((tmp_path / "last_run.json").read_text())
+    assert data == {"last_date_to": "2025-06-01"}
+```
+
+- [ ] **Step 3: Run tests to confirm they fail**
+
+```bash
+PYTHONPATH=. pytest tests/test_incremental_snapshot.py -v
+```
+
+Expected: all tests FAIL with `AttributeError: 'SnapshotDownloader' object has no attribute 'load_snapshot_cache'`.
+
+- [ ] **Step 4: Add the four I/O methods to `SnapshotDownloader` in `src/snapshot_downloader.py`**
+
+Add after the `download_media_file` method (around line 540), before `generate_html_file`:
+
+```python
+def load_snapshot_cache(self) -> List[Dict[str, Any]]:
+    cache_file = self.output_dir / "snapshots_cache.json"
+    if not cache_file.exists():
+        return []
+    try:
+        with open(cache_file, "r", encoding="utf-8") as f:
+            data = json.load(f)
+        return data if isinstance(data, list) else []
+    except (json.JSONDecodeError, OSError):
+        self.logger.warning("Could not read snapshot cache; starting fresh")
+        return []
+
+def save_snapshot_cache(self, snapshots: List[Dict[str, Any]]) -> None:
+    cache_file = self.output_dir / "snapshots_cache.json"
+    with open(cache_file, "w", encoding="utf-8") as f:
+        json.dump(snapshots, f, indent=2, default=str)
+
+def load_last_run_date(self) -> Optional[str]:
+    state_file = self.output_dir / "last_run.json"
+    if not state_file.exists():
+        return None
+    try:
+        with open(state_file, "r", encoding="utf-8") as f:
+            data = json.load(f)
+        return data.get("last_date_to")
+    except (json.JSONDecodeError, OSError):
+        return None
+
+def save_last_run_date(self, date: str) -> None:
+    state_file = self.output_dir / "last_run.json"
+    with open(state_file, "w", encoding="utf-8") as f:
+        json.dump({"last_date_to": date}, f)
+```
+
+- [ ] **Step 5: Run tests to confirm they pass**
+
+```bash
+PYTHONPATH=. pytest tests/test_incremental_snapshot.py -v
+```
+
+Expected: all tests PASS.
+
+- [ ] **Step 6: Commit**
+
+```bash
+git add src/snapshot_downloader.py tests/test_incremental_snapshot.py
+git commit -m "feat: add snapshot cache and state file I/O methods"
+```
+
+---
+
+## Task 2: Fixed output filename
+
+**Files:**
+- Modify: `src/snapshot_downloader.py:562`
+- Modify: `tests/test_incremental_snapshot.py`
+
+- [ ] **Step 1: Add a failing test for the fixed filename**
+
+Append to `tests/test_incremental_snapshot.py`:
+
+```python
+# --- generate_html_file fixed filename ---
+
+def test_generate_html_file_uses_fixed_filename(tmp_path):
+    d = _downloader(tmp_path)
+    with patch.object(d, "generate_html_template", new_callable=AsyncMock, return_value="<html></html>"):
+        result = asyncio.run(d.generate_html_file([], "2024-01-01", "2025-01-01"))
+    assert result.name == "snapshots.html"
+    assert (tmp_path / "snapshots.html").exists()
+```
+
+- [ ] **Step 2: Run to confirm it fails**
+
+```bash
+PYTHONPATH=. pytest tests/test_incremental_snapshot.py::test_generate_html_file_uses_fixed_filename -v
+```
+
+Expected: FAIL — the file is named `snapshots_2024-01-01_to_2025-01-01.html`, not `snapshots.html`.
+
+- [ ] **Step 3: Change the filename in `generate_html_file`**
+
+In `src/snapshot_downloader.py`, find (around line 562):
+
+```python
+        filename = f"snapshots_{date_from}_to_{date_to}.html"
+```
+
+Replace with:
+
+```python
+        filename = "snapshots.html"
+```
+
+- [ ] **Step 4: Run tests to confirm they pass**
+
+```bash
+PYTHONPATH=. pytest tests/test_incremental_snapshot.py -v
+```
+
+Expected: all tests PASS.
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add src/snapshot_downloader.py tests/test_incremental_snapshot.py
+git commit -m "feat: write snapshots to fixed filename snapshots.html"
+```
+
+---
+
+## Task 3: Incremental `download_snapshots`
+
+**Files:**
+- Modify: `src/snapshot_downloader.py:976–1036`
+- Modify: `tests/test_incremental_snapshot.py`
+
+- [ ] **Step 1: Add failing tests for the incremental orchestration**
+
+Append to `tests/test_incremental_snapshot.py`:
+
+```python
+# --- incremental download_snapshots ---
+
+def _run_download(d, **kwargs):
+    """Run download_snapshots with mocked API calls."""
+    new_snapshots = kwargs.pop("new_snapshots", [])
+    mock_fetch = AsyncMock(return_value=new_snapshots)
+    with patch.object(d, "authenticate", new_callable=AsyncMock):
+        with patch.object(d, "fetch_all_snapshots", mock_fetch):
+            with patch.object(d, "generate_html_file", new_callable=AsyncMock,
+                              return_value=d.output_dir / "snapshots.html"):
+                asyncio.run(d.download_snapshots(**kwargs))
+    return mock_fetch
+
+
+def test_first_run_saves_cache_and_state(tmp_path):
+    d = _downloader(tmp_path)
+    new_snapshots = [{"id": "abc", "startTime": "2025-01-15T10:00:00Z"}]
+    _run_download(d, date_from="2024-01-01", new_snapshots=new_snapshots)
+
+    assert d.load_snapshot_cache() == new_snapshots
+    assert d.load_last_run_date() is not None
+
+
+def test_subsequent_run_uses_last_run_date_as_fetch_from(tmp_path):
+    d = _downloader(tmp_path)
+    d.save_last_run_date("2025-03-01")
+    d.save_snapshot_cache([{"id": "old", "startTime": "2025-02-01T00:00:00Z"}])
+
+    new_snapshots = [{"id": "new", "startTime": "2025-03-15T00:00:00Z"}]
+    mock_fetch = _run_download(d, date_from="2024-01-01", new_snapshots=new_snapshots)
+
+    # Third positional arg to fetch_all_snapshots is date_from (after session, type_ids)
+    assert mock_fetch.call_args.args[2] == "2025-03-01"
+
+    ids = {s["id"] for s in d.load_snapshot_cache()}
+    assert ids == {"old", "new"}
+
+
+def test_deduplication_by_id(tmp_path):
+    d = _downloader(tmp_path)
+    d.save_last_run_date("2025-01-01")
+    d.save_snapshot_cache([{"id": "dup", "startTime": "2025-01-01T00:00:00Z"}])
+
+    # API returns the boundary snapshot again plus one new one
+    new_snapshots = [
+        {"id": "dup", "startTime": "2025-01-01T00:00:00Z"},
+        {"id": "fresh", "startTime": "2025-01-02T00:00:00Z"},
+    ]
+    _run_download(d, date_from="2024-01-01", new_snapshots=new_snapshots)
+
+    cache = d.load_snapshot_cache()
+    ids = [s["id"] for s in cache]
+    assert ids.count("dup") == 1
+    assert "fresh" in ids
+
+
+def test_fetch_failure_does_not_update_state(tmp_path):
+    d = _downloader(tmp_path)
+    d.save_last_run_date("2025-01-01")
+    d.save_snapshot_cache([{"id": "existing"}])
+
+    with patch.object(d, "authenticate", new_callable=AsyncMock):
+        with patch.object(d, "fetch_all_snapshots", new_callable=AsyncMock,
+                          side_effect=Exception("network error")):
+            with pytest.raises(Exception, match="network error"):
+                asyncio.run(d.download_snapshots(date_from="2024-01-01"))
+
+    assert d.load_last_run_date() == "2025-01-01"
+    assert d.load_snapshot_cache() == [{"id": "existing"}]
+```
+
+- [ ] **Step 2: Run to confirm they fail**
+
+```bash
+PYTHONPATH=. pytest tests/test_incremental_snapshot.py::test_first_run_saves_cache_and_state tests/test_incremental_snapshot.py::test_subsequent_run_uses_last_run_date_as_fetch_from tests/test_incremental_snapshot.py::test_deduplication_by_id tests/test_incremental_snapshot.py::test_fetch_failure_does_not_update_state -v
+```
+
+Expected: all four FAIL (the method does not yet do incremental logic).
+
+- [ ] **Step 3: Replace `download_snapshots` in `src/snapshot_downloader.py`**
+
+Find the entire `download_snapshots` method (lines 976–1036) and replace it with:
+
+```python
+    async def download_snapshots(
+        self,
+        type_ids: List[int] = [15],
+        date_from: str = None,
+        date_to: str = None,
+        max_pages: int = None,
+    ) -> Path:
+        """
+        Download new snapshots incrementally and regenerate snapshots.html.
+
+        date_from is used only on the first run (no last_run.json).
+        date_to is always today regardless of what is passed.
+        """
+        date_to = datetime.now().strftime("%Y-%m-%d")
+
+        # Determine fetch window start
+        last_run_date = self.load_last_run_date()
+        if last_run_date:
+            fetch_from = last_run_date
+            self.logger.info(f"Incremental run: fetching from {fetch_from}")
+        else:
+            if date_from is None:
+                date_from = (datetime.now() - timedelta(days=365)).strftime("%Y-%m-%d")
+            fetch_from = date_from
+            self.logger.info(f"First run: fetching all snapshots from {fetch_from}")
+
+        self.logger.info(f"Fetch window: {fetch_from} to {date_to}")
+
+        # Load accumulated snapshot data
+        existing_snapshots = self.load_snapshot_cache()
+        self.logger.info(f"Loaded {len(existing_snapshots)} snapshots from cache")
+
+        connector = aiohttp.TCPConnector(limit=100, limit_per_host=30)
+        timeout = aiohttp.ClientTimeout(total=30)
+
+        async with aiohttp.ClientSession(connector=connector, timeout=timeout) as session:
+            # Authenticate if needed
+            await self.authenticate()
+
+            # Fetch only new snapshots
+            new_snapshots = await self.fetch_all_snapshots(
+                session, type_ids, fetch_from, date_to, max_pages
+            )
+
+        # Merge: deduplicate by id
+        existing_ids = {s.get("id") for s in existing_snapshots}
+        added = [s for s in new_snapshots if s.get("id") not in existing_ids]
+        merged = existing_snapshots + added
+        self.logger.info(f"Added {len(added)} new snapshots (total: {len(merged)})")
+
+        if not merged:
+            self.logger.warning("No snapshots found")
+            return None
+
+        # Persist updated cache and state
+        self.save_snapshot_cache(merged)
+        html_file = await self.generate_html_file(merged, date_from or fetch_from, date_to)
+        self.save_last_run_date(date_to)
+
+        self.print_statistics()
+        return html_file
+```
+
+- [ ] **Step 4: Run all tests to confirm they pass**
+
+```bash
+PYTHONPATH=. pytest tests/test_incremental_snapshot.py -v
+```
+
+Expected: all tests PASS.
+
+- [ ] **Step 5: Commit**
+
+```bash
+git add src/snapshot_downloader.py tests/test_incremental_snapshot.py
+git commit -m "feat: incremental snapshot fetch with JSON cache and state file"
+```
+
+---
+
+## Self-Review Checklist
+
+- Spec: all 4 I/O methods — covered in Task 1
+- Spec: fixed filename — covered in Task 2
+- Spec: incremental run logic (7 steps) — covered in Task 3
+- Spec: fetch failure leaves state unchanged — covered by `test_fetch_failure_does_not_update_state`
+- Spec: deduplication by `id` — covered by `test_deduplication_by_id`
+- Spec: `ConfigSnapshotDownloader` unchanged — no tasks touch it
+- Method names consistent across all tasks: `load_snapshot_cache`, `save_snapshot_cache`, `load_last_run_date`, `save_last_run_date`
+- `fetch_all_snapshots` call args order `(session, type_ids, fetch_from, date_to, max_pages)` matches existing signature at line 245
@@ -0,0 +1,61 @@
+# Incremental Snapshot Downloader Design
+
+**Date:** 2026-05-15
+**Status:** Approved
+
+## Problem
+
+The snapshot downloader currently produces a new HTML file on every daily run, named `snapshots_{date_from}_to_{date_to}.html`. Because `date_to` changes each day, files accumulate with near-identical content. The user wants a single file updated in place each day.
+
+## Approach
+
+Incremental fetch with a JSON data cache and a state file. On each run, only new snapshots are fetched from the API; they are merged into a persistent cache of all snapshot objects; the single output HTML is re-rendered from the full cache.
+
+## Data Files
+
+All written to `output_dir`:
+
+| File | Purpose |
+|---|---|
+| `snapshots_cache.json` | Array of every snapshot object fetched so far |
+| `last_run.json` | `{"last_date_to": "YYYY-MM-DD"}` — upper bound of the last successful fetch |
+| `snapshots.html` | Single output file, fixed name, overwritten each run |
+
+## Run Logic
+
+1. Read `last_run.json` → use `last_date_to` as `date_from` for this run. Fall back to the `date_from` argument (from config) if no state file exists yet (first run).
+2. Load `snapshots_cache.json` if it exists → existing snapshot list. Empty list if absent.
+3. Fetch new snapshots from the API with `date_from` (from step 1) and `date_to = today`.
+4. Merge: append new snapshots to the cache list, deduplicating by snapshot `id`. The boundary date may be returned by the API again on subsequent runs.
+5. Save the updated `snapshots_cache.json`.
+6. Re-render `snapshots.html` from the full merged list (sorted newest-first, existing logic unchanged).
+7. Write `last_run.json` with today's date.
+
+**On fetch failure:** steps 5–7 are skipped. The cache and state file remain unchanged, so the next run retries from the same `date_from`.
+
+## Code Changes
+
+All changes are in `src/snapshot_downloader.py`. `ConfigSnapshotDownloader` is unchanged.
+
+### New methods on `SnapshotDownloader`
+
+- `load_snapshot_cache() -> List[Dict]` — reads `snapshots_cache.json`; returns `[]` if absent or malformed
+- `save_snapshot_cache(snapshots: List[Dict])` — writes `snapshots_cache.json`
+- `load_last_run_date() -> Optional[str]` — reads `last_run.json`; returns `None` if absent
+- `save_last_run_date(date: str)` — writes `last_run.json`
+
+### Modified methods
+
+**`download_snapshots(type_ids, date_from, date_to, max_pages)`**
+
+`date_from` becomes the "initial history start" fallback (used only when `last_run.json` does not exist). `date_to` is always set to today's date, regardless of what is passed. The method orchestrates the 7-step incremental flow above.
+
+**`generate_html_file(snapshots, date_from, date_to)`**
+
+Output filename changes from `snapshots_{date_from}_to_{date_to}.html` to the fixed name `snapshots.html`.
+
+## Out of Scope
+
+- Detecting and applying updates to existing snapshot objects (notes edits, etc.) — snapshots are treated as append-only
+- Pruning the cache (old entries are kept indefinitely)
+- Changes to `ConfigSnapshotDownloader`, the web server, or any other module
@@ -190,7 +190,6 @@ class SnapshotDownloader:
            async with session.get(url, headers=headers, timeout=30) as response:
                response.raise_for_status()
                data = await response.json()
-                print(data.get("posts", []))
                # Log API response summary
                posts_count = len(data.get("posts", []))
                has_cursor = bool(data.get("cursor"))
@@ -539,6 +538,46 @@ class SnapshotDownloader:
            return None


+    def load_snapshot_cache(self) -> List[Dict[str, Any]]:
+        cache_file = self.output_dir / "snapshots_cache.json"
+        if not cache_file.exists():
+            return []
+        try:
+            with open(cache_file, "r", encoding="utf-8") as f:
+                data = json.load(f)
+            return data if isinstance(data, list) else []
+        except (json.JSONDecodeError, OSError):
+            self.logger.warning("Could not read snapshot cache; starting fresh")
+            return []
+
+    def save_snapshot_cache(self, snapshots: List[Dict[str, Any]]) -> None:
+        cache_file = self.output_dir / "snapshots_cache.json"
+        try:
+            with open(cache_file, "w", encoding="utf-8") as f:
+                json.dump(snapshots, f, indent=2, default=str)
+        except OSError as e:
+            self.logger.warning(f"Could not write snapshot cache: {e}")
+
+    def load_last_run_date(self) -> Optional[str]:
+        state_file = self.output_dir / "last_run.json"
+        if not state_file.exists():
+            return None
+        try:
+            with open(state_file, "r", encoding="utf-8") as f:
+                data = json.load(f)
+            return data.get("last_date_to")
+        except (json.JSONDecodeError, OSError):
+            self.logger.warning("Could not read last run date; will do full fetch")
+            return None
+
+    def save_last_run_date(self, date: str) -> None:
+        state_file = self.output_dir / "last_run.json"
+        try:
+            with open(state_file, "w", encoding="utf-8") as f:
+                json.dump({"last_date_to": date}, f, indent=2)
+        except OSError as e:
+            self.logger.warning(f"Could not write last run date: {e}")
+
    async def generate_html_file(
        self, snapshots: List[Dict[str, Any]], date_from: str, date_to: str
    ) -> Path:
@@ -547,8 +586,8 @@ class SnapshotDownloader:

        Args:
            snapshots: List of snapshot dictionaries
-            date_from: Start date
-            date_to: End date
+            date_from: Start date shown in the HTML page title (does not affect the filename)
+            date_to: End date shown in the HTML page title (does not affect the filename)

        Returns:
            Path to the generated HTML file
@@ -558,8 +597,8 @@ class SnapshotDownloader:
            snapshots, key=lambda x: x.get("startTime", ""), reverse=True
        )

-        # Generate filename
-        filename = f"snapshots_{date_from}_to_{date_to}.html"
+        # Fixed filename — overwritten on each run
+        filename = "snapshots.html"
        filepath = self.output_dir / filename

        # Generate HTML content
@@ -591,9 +630,6 @@ class SnapshotDownloader:
        async with aiohttp.ClientSession(
            connector=connector, timeout=timeout
        ) as session:
-            # Authenticate session for media downloads
-            await self.authenticate()
-
            for index, snapshot in enumerate(snapshots):
                snapshot_html = await self.format_snapshot_html(snapshot, session, index)
                snapshots_html += snapshot_html
@@ -981,60 +1017,62 @@ class SnapshotDownloader:
        max_pages: int = None,
    ) -> Path:
        """
-        Download all snapshots and generate HTML file.
+        Download new snapshots incrementally and regenerate snapshots.html.

-        Args:
-            type_ids: List of type IDs to filter by (default: [15])
-            date_from: Start date in YYYY-MM-DD format
-            date_to: End date in YYYY-MM-DD format
-            max_pages: Maximum number of pages to fetch
-
-        Returns:
-            Path to generated HTML file
+        date_from is used only on the first run (no last_run.json).
+        date_to is always today regardless of what is passed.
        """
-        # Set default dates if not provided
-        if date_from is None:
-            # Default to 1 year ago
-            date_from = (datetime.now() - timedelta(days=365)).strftime("%Y-%m-%d")
-        if date_to is None:
        date_to = datetime.now().strftime("%Y-%m-%d")

-        self.logger.info(
-            f"Starting snapshot download for period {date_from} to {date_to}"
-        )
+        # Resolve date_from to a concrete value (used for HTML page title)
+        if date_from is None:
+            date_from = (datetime.now() - timedelta(days=365)).strftime("%Y-%m-%d")
+
+        # Determine fetch window start
+        last_run_date = self.load_last_run_date()
+        if last_run_date:
+            fetch_from = last_run_date
+            self.logger.info(f"Incremental run: fetching from {fetch_from}")
+        else:
+            fetch_from = date_from
+            self.logger.info(f"First run: fetching all snapshots from {fetch_from}")
+
+        self.logger.info(f"Fetch window: {fetch_from} to {date_to}")
+
+        # Load accumulated snapshot data
+        existing_snapshots = self.load_snapshot_cache()
+        self.logger.info(f"Loaded {len(existing_snapshots)} snapshots from cache")

-        # Create aiohttp session
        connector = aiohttp.TCPConnector(limit=100, limit_per_host=30)
        timeout = aiohttp.ClientTimeout(total=30)

-        async with aiohttp.ClientSession(
-            connector=connector, timeout=timeout
-        ) as session:
-            try:
+        async with aiohttp.ClientSession(connector=connector, timeout=timeout) as session:
            # Authenticate if needed
            await self.authenticate()

-                # Fetch all snapshots
-                snapshots = await self.fetch_all_snapshots(
-                    session, type_ids, date_from, date_to, max_pages
+            # Fetch only new snapshots
+            new_snapshots = await self.fetch_all_snapshots(
+                session, type_ids, fetch_from, date_to, max_pages
            )

-                if not snapshots:
-                    self.logger.warning("No snapshots found for the specified period")
+        # Merge: deduplicate by id
+        existing_ids = {s.get("id") for s in existing_snapshots}
+        added = [s for s in new_snapshots if s.get("id") not in existing_ids]
+        merged = existing_snapshots + added
+        self.logger.info(f"Added {len(added)} new snapshots (total: {len(merged)})")
+
+        if not merged:
+            self.logger.warning("No snapshots found")
            return None

-                # Generate HTML file
-                html_file = await self.generate_html_file(snapshots, date_from, date_to)
+        # Persist updated cache and state
+        self.save_snapshot_cache(merged)
+        html_file = await self.generate_html_file(merged, date_from, date_to)
+        self.save_last_run_date(date_to)

-                # Print statistics
        self.print_statistics()
-
        return html_file

-            except Exception as e:
-                self.logger.error(f"Error during snapshot download: {e}")
-                raise
-
    def print_statistics(self):
        """Print download statistics."""
        print("\n" + "=" * 60)
@@ -0,0 +1,175 @@
+import asyncio
+import json
+import pytest
+from unittest.mock import patch, AsyncMock
+
+from src.snapshot_downloader import SnapshotDownloader
+
+
+def _downloader(tmp_path):
+    return SnapshotDownloader(output_dir=str(tmp_path), api_key="test-key")
+
+
+# --- load_snapshot_cache ---
+
+def test_load_snapshot_cache_missing(tmp_path):
+    assert _downloader(tmp_path).load_snapshot_cache() == []
+
+
+def test_load_snapshot_cache_returns_data(tmp_path):
+    d = _downloader(tmp_path)
+    snapshots = [{"id": "1", "notes": "hello"}]
+    (tmp_path / "snapshots_cache.json").write_text(json.dumps(snapshots))
+    assert d.load_snapshot_cache() == snapshots
+
+
+def test_load_snapshot_cache_malformed_returns_empty(tmp_path):
+    d = _downloader(tmp_path)
+    (tmp_path / "snapshots_cache.json").write_text("not json{{{")
+    assert d.load_snapshot_cache() == []
+
+
+def test_load_snapshot_cache_non_list_returns_empty(tmp_path):
+    d = _downloader(tmp_path)
+    (tmp_path / "snapshots_cache.json").write_text('{"key": "val"}')
+    assert d.load_snapshot_cache() == []
+
+
+# --- save_snapshot_cache ---
+
+def test_save_snapshot_cache_writes_json(tmp_path):
+    d = _downloader(tmp_path)
+    snapshots = [{"id": "1"}, {"id": "2"}]
+    d.save_snapshot_cache(snapshots)
+    data = json.loads((tmp_path / "snapshots_cache.json").read_text())
+    assert data == snapshots
+
+
+# --- load_last_run_date ---
+
+def test_load_last_run_date_missing(tmp_path):
+    assert _downloader(tmp_path).load_last_run_date() is None
+
+
+def test_load_last_run_date_returns_date(tmp_path):
+    d = _downloader(tmp_path)
+    (tmp_path / "last_run.json").write_text('{"last_date_to": "2025-01-01"}')
+    assert d.load_last_run_date() == "2025-01-01"
+
+
+def test_load_last_run_date_malformed_returns_none(tmp_path):
+    d = _downloader(tmp_path)
+    (tmp_path / "last_run.json").write_text("not json")
+    assert d.load_last_run_date() is None
+
+
+def test_load_last_run_date_missing_key_returns_none(tmp_path):
+    d = _downloader(tmp_path)
+    (tmp_path / "last_run.json").write_text('{"date": "2025-01-01"}')
+    assert d.load_last_run_date() is None
+
+
+# --- save_last_run_date ---
+
+def test_save_last_run_date_writes_json(tmp_path):
+    d = _downloader(tmp_path)
+    d.save_last_run_date("2025-06-01")
+    data = json.loads((tmp_path / "last_run.json").read_text())
+    assert data == {"last_date_to": "2025-06-01"}
+
+
+# --- generate_html_file fixed filename ---
+
+def test_generate_html_file_uses_fixed_filename(tmp_path):
+    d = _downloader(tmp_path)
+    with patch.object(d, "generate_html_template", new_callable=AsyncMock, return_value="<html></html>"):
+        result = asyncio.run(d.generate_html_file([], "2024-01-01", "2025-01-01"))
+    assert result.name == "snapshots.html"
+    assert (tmp_path / "snapshots.html").exists()
+
+
+# --- incremental download_snapshots ---
+
+def _run_download(d, **kwargs):
+    """Run download_snapshots with mocked API calls."""
+    new_snapshots = kwargs.pop("new_snapshots", [])
+    mock_fetch = AsyncMock(return_value=new_snapshots)
+    with patch.object(d, "authenticate", new_callable=AsyncMock):
+        with patch.object(d, "fetch_all_snapshots", mock_fetch):
+            with patch.object(d, "generate_html_file", new_callable=AsyncMock,
+                              return_value=d.output_dir / "snapshots.html"):
+                asyncio.run(d.download_snapshots(**kwargs))
+    return mock_fetch
+
+
+def test_first_run_saves_cache_and_state(tmp_path):
+    d = _downloader(tmp_path)
+    new_snapshots = [{"id": "abc", "startTime": "2025-01-15T10:00:00Z"}]
+    _run_download(d, date_from="2024-01-01", new_snapshots=new_snapshots)
+
+    assert d.load_snapshot_cache() == new_snapshots
+    assert d.load_last_run_date() is not None
+
+
+def test_subsequent_run_uses_last_run_date_as_fetch_from(tmp_path):
+    d = _downloader(tmp_path)
+    d.save_last_run_date("2025-03-01")
+    d.save_snapshot_cache([{"id": "old", "startTime": "2025-02-01T00:00:00Z"}])
+
+    new_snapshots = [{"id": "new", "startTime": "2025-03-15T00:00:00Z"}]
+    mock_fetch = _run_download(d, date_from="2024-01-01", new_snapshots=new_snapshots)
+
+    # Third positional arg to fetch_all_snapshots is date_from (after session, type_ids)
+    assert mock_fetch.call_args.args[2] == "2025-03-01"
+
+    ids = {s["id"] for s in d.load_snapshot_cache()}
+    assert ids == {"old", "new"}
+
+
+def test_deduplication_by_id(tmp_path):
+    d = _downloader(tmp_path)
+    d.save_last_run_date("2025-01-01")
+    d.save_snapshot_cache([{"id": "dup", "startTime": "2025-01-01T00:00:00Z"}])
+
+    # API returns the boundary snapshot again plus one new one
+    new_snapshots = [
+        {"id": "dup", "startTime": "2025-01-01T00:00:00Z"},
+        {"id": "fresh", "startTime": "2025-01-02T00:00:00Z"},
+    ]
+    _run_download(d, date_from="2024-01-01", new_snapshots=new_snapshots)
+
+    cache = d.load_snapshot_cache()
+    ids = [s["id"] for s in cache]
+    assert ids.count("dup") == 1
+    assert "fresh" in ids
+
+
+def test_fetch_failure_does_not_update_state(tmp_path):
+    d = _downloader(tmp_path)
+    d.save_last_run_date("2025-01-01")
+    d.save_snapshot_cache([{"id": "existing"}])
+
+    with patch.object(d, "authenticate", new_callable=AsyncMock):
+        with patch.object(d, "fetch_all_snapshots", new_callable=AsyncMock,
+                          side_effect=Exception("network error")):
+            with pytest.raises(Exception, match="network error"):
+                asyncio.run(d.download_snapshots(date_from="2024-01-01"))
+
+    assert d.load_last_run_date() == "2025-01-01"
+    assert d.load_snapshot_cache() == [{"id": "existing"}]
+
+
+def test_html_generation_failure_does_not_update_state_file(tmp_path):
+    d = _downloader(tmp_path)
+    d.save_last_run_date("2025-01-01")
+    d.save_snapshot_cache([{"id": "existing"}])
+    new_snapshots = [{"id": "new", "startTime": "2025-02-01T00:00:00Z"}]
+    with patch.object(d, "authenticate", new_callable=AsyncMock):
+        with patch.object(d, "fetch_all_snapshots", new_callable=AsyncMock,
+                          return_value=new_snapshots):
+            with patch.object(d, "generate_html_file", new_callable=AsyncMock,
+                              side_effect=OSError("disk full")):
+                with pytest.raises(OSError):
+                    asyncio.run(d.download_snapshots(date_from="2024-01-01"))
+    # Cache was updated with new data, but state file was NOT advanced
+    assert d.load_last_run_date() == "2025-01-01"
Author	SHA1	Message	Date
Tudor Sitaru	5f9014a075	adding claude md initial file Build Docker Image / build (push) Successful in 2m40s Details	2026-05-16 20:34:42 +01:00
Tudor Sitaru	aa656c3008	fix: remove double authenticate, remove debug print, fix display date	2026-05-16 20:29:22 +01:00
Tudor Sitaru	778ae47ebe	test: verify state file not updated when HTML generation fails Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 20:26:10 +01:00
Tudor Sitaru	5524373e78	feat: incremental snapshot fetch with JSON cache and state file Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 20:22:54 +01:00
Tudor Sitaru	68d5996165	feat: write snapshots to fixed filename snapshots.html Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 20:17:30 +01:00
Tudor Sitaru	b13f38c821	fix: handle OSError in save methods, add missing test, consistent logging - Wrap save_snapshot_cache and save_last_run_date in try/except OSError, logging a warning instead of propagating the exception - Add indent=2 to save_last_run_date for consistency - Add warning log to load_last_run_date on read failure (matching load_snapshot_cache pattern) - Add test_load_last_run_date_missing_key_returns_none covering valid JSON with absent key - Remove unused asyncio, AsyncMock, and patch imports from test file Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-16 20:15:24 +01:00
Tudor Sitaru	d77226413d	feat: add snapshot cache and state file I/O methods Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-15 16:09:50 +01:00
Tudor Sitaru	8284b2593e	Add implementation plan for incremental snapshot downloader	2026-05-15 15:27:51 +01:00
Tudor Sitaru	36bb97c0d1	Add design spec for incremental snapshot downloader	2026-05-15 15:24:39 +01:00