Files
parentzone_downloader/ASSET_TRACKING_README.md
Tudor Sitaru ddde67ca62 first commit
2025-10-07 14:52:04 +01:00

11 KiB

Asset Tracking System

This document describes the asset tracking system implemented for the ParentZone Downloader, which intelligently identifies and downloads only new or modified assets, avoiding unnecessary re-downloads.

Overview

The asset tracking system consists of two main components:

  1. AssetTracker (asset_tracker.py) - Manages local metadata and identifies new/modified assets
  2. ImageDownloader Integration - Enhanced downloader with asset tracking capabilities

Features

🎯 Smart Asset Detection

  • New Assets: Automatically detects assets that haven't been downloaded before
  • Modified Assets: Identifies assets that have changed since last download (based on timestamp, size, etc.)
  • Unchanged Assets: Efficiently skips assets that are already up-to-date locally

📊 Comprehensive Tracking

  • Metadata Storage: Stores asset metadata in JSON format for persistence
  • File Integrity: Tracks file sizes, modification times, and content hashes
  • Download History: Maintains records of successful and failed downloads

🧹 Maintenance Features

  • Cleanup: Removes metadata for files that no longer exist on disk
  • Statistics: Provides detailed statistics about tracked assets
  • Validation: Ensures consistency between metadata and actual files

Quick Start

Basic Usage with Asset Tracking

# Download only new/modified assets (default behavior)
python3 image_downloader.py \
    --api-url "https://api.parentzone.me" \
    --list-endpoint "/v1/media/list" \
    --download-endpoint "/v1/media" \
    --output-dir "./downloaded_images" \
    --email "your-email@example.com" \
    --password "your-password"

Advanced Options

# Disable asset tracking (download all assets)
python3 image_downloader.py [options] --no-tracking

# Force re-download of all assets
python3 image_downloader.py [options] --force-redownload

# Show asset tracking statistics
python3 image_downloader.py [options] --show-stats

# Clean up metadata for missing files
python3 image_downloader.py [options] --cleanup

Asset Tracker API

Basic Usage

from asset_tracker import AssetTracker

# Initialize tracker
tracker = AssetTracker(storage_dir="downloaded_images")

# Get new assets that need downloading
api_assets = [...]  # Assets from API response
new_assets = tracker.get_new_assets(api_assets)

# Mark an asset as downloaded
tracker.mark_asset_downloaded(asset, filepath, success=True)

# Get statistics
stats = tracker.get_stats()

Key Methods

get_new_assets(api_assets: List[Dict]) -> List[Dict]

Identifies new or modified assets that need to be downloaded.

Parameters:

  • api_assets: List of asset dictionaries from API response

Returns:

  • List of assets that need to be downloaded

Example:

# API returns 100 assets, but only 5 are new/modified
api_assets = await fetch_assets_from_api()
new_assets = tracker.get_new_assets(api_assets)
print(f"Need to download {len(new_assets)} out of {len(api_assets)} assets")

mark_asset_downloaded(asset: Dict, filepath: Path, success: bool)

Records that an asset has been downloaded (or attempted).

Parameters:

  • asset: Asset dictionary from API
  • filepath: Local path where asset was saved
  • success: Whether download was successful

cleanup_missing_files()

Removes metadata entries for files that no longer exist on disk.

get_stats() -> Dict

Returns comprehensive statistics about tracked assets.

Returns:

{
    'total_tracked_assets': 150,
    'successful_downloads': 145,
    'failed_downloads': 5,
    'existing_files': 140,
    'missing_files': 10,
    'total_size_bytes': 524288000,
    'total_size_mb': 500.0
}

Metadata Storage

File Structure

Asset metadata is stored in {output_dir}/asset_metadata.json:

{
  "asset_001": {
    "asset_id": "asset_001",
    "filename": "family_photo.jpg",
    "filepath": "/path/to/downloaded_images/family_photo.jpg",
    "download_date": "2024-01-15T10:30:00",
    "success": true,
    "content_hash": "d41d8cd98f00b204e9800998ecf8427e",
    "file_size": 1024000,
    "file_modified": "2024-01-15T10:30:00",
    "api_data": {
      "id": "asset_001",
      "name": "family_photo.jpg",
      "updated": "2024-01-01T10:00:00Z",
      "size": 1024000,
      "mimeType": "image/jpeg"
    }
  }
}

Asset Identification

Assets are identified using the following priority:

  1. id field
  2. assetId field
  3. uuid field
  4. MD5 hash of asset data (fallback)

Change Detection

Assets are considered modified if their content hash changes. The hash is based on:

  • updated timestamp
  • modified timestamp
  • lastModified timestamp
  • size field
  • checksum field
  • etag field

Integration with ImageDownloader

Automatic Integration

When asset tracking is enabled (default), the ImageDownloader automatically:

  1. Initializes Tracker: Creates an AssetTracker instance
  2. Filters Assets: Only downloads new/modified assets
  3. Records Downloads: Marks successful/failed downloads in metadata
  4. Provides Feedback: Shows statistics about skipped vs downloaded assets

Example Integration

from image_downloader import ImageDownloader

# Asset tracking enabled by default
downloader = ImageDownloader(
    api_url="https://api.parentzone.me",
    list_endpoint="/v1/media/list",
    download_endpoint="/v1/media",
    output_dir="./images",
    email="user@example.com",
    password="password",
    track_assets=True  # Default: True
)

# First run: Downloads all assets
await downloader.download_all_assets()

# Second run: Skips unchanged assets, downloads only new/modified ones  
await downloader.download_all_assets()

Testing

Unit Tests

# Run comprehensive asset tracking tests
python3 test_asset_tracking.py

# Output shows:
# ✅ Basic tracking test passed!
# ✅ Modified asset detection test passed!
# ✅ Cleanup functionality test passed!
# ✅ Integration test completed!

Live Demo

# Demonstrate asset tracking with real API
python3 demo_asset_tracking.py

# Shows:
# - Authentication process
# - Current asset status
# - First download run (downloads new assets)
# - Second run (skips all assets)
# - Final statistics

Performance Benefits

Network Efficiency

  • Reduced API Calls: Only downloads assets that have changed
  • Bandwidth Savings: Skips unchanged assets entirely
  • Faster Sync: Subsequent runs complete much faster

Storage Efficiency

  • No Duplicates: Prevents downloading the same asset multiple times
  • Smart Cleanup: Removes metadata for deleted files
  • Size Tracking: Monitors total storage usage

Example Performance Impact

First Run:  150 assets → Downloaded 150 (100%)
Second Run: 150 assets → Downloaded 0 (0%) - All up to date!
Third Run:  155 assets → Downloaded 5 (3.2%) - Only new ones

Troubleshooting

Common Issues

"No existing metadata file found"

This is normal for first-time usage. The system will create the metadata file automatically.

"File missing, removing from metadata"

The cleanup process found files that were deleted outside the application. This is normal maintenance.

Asset tracking not working

Ensure AssetTracker is properly imported and asset tracking is enabled:

# Check if tracking is enabled
if downloader.asset_tracker:
    print("Asset tracking is enabled")
else:
    print("Asset tracking is disabled")

Manual Maintenance

Reset All Tracking

# Remove metadata file to start fresh
rm downloaded_images/asset_metadata.json

Clean Up Missing Files

python3 image_downloader.py --cleanup --output-dir "./downloaded_images"

View Statistics

python3 image_downloader.py --show-stats --output-dir "./downloaded_images"

Configuration

Environment Variables

# Disable asset tracking globally
export DISABLE_ASSET_TRACKING=1

# Set custom metadata filename
export ASSET_METADATA_FILE="my_assets.json"

Programmatic Configuration

# Custom metadata file location
tracker = AssetTracker(
    storage_dir="./images",
    metadata_file="custom_metadata.json"
)

# Disable tracking for specific downloader
downloader = ImageDownloader(
    # ... other params ...
    track_assets=False
)

Future Enhancements

Planned Features

  • Parallel Metadata Updates: Concurrent metadata operations
  • Cloud Sync: Sync metadata across multiple devices
  • Asset Versioning: Track multiple versions of the same asset
  • Batch Operations: Bulk metadata operations for large datasets
  • Web Interface: Browser-based asset management

Extensibility

The asset tracking system is designed to be extensible:

# Custom asset identification
class CustomAssetTracker(AssetTracker):
    def _get_asset_key(self, asset):
        # Custom logic for asset identification
        return f"{asset.get('category')}_{asset.get('id')}"
    
    def _get_asset_hash(self, asset):
        # Custom logic for change detection
        return super()._get_asset_hash(asset)

API Reference

AssetTracker Class

Method Description Parameters Returns
__init__ Initialize tracker storage_dir, metadata_file None
get_new_assets Find new/modified assets api_assets: List[Dict] List[Dict]
mark_asset_downloaded Record download asset, filepath, success None
is_asset_downloaded Check if downloaded asset: Dict bool
is_asset_modified Check if modified asset: Dict bool
cleanup_missing_files Remove stale metadata None None
get_stats Get statistics None Dict
print_stats Print formatted stats None None

ImageDownloader Integration

Parameter Type Default Description
track_assets bool True Enable asset tracking
Method Description Parameters
download_all_assets Download assets force_redownload: bool = False

Command Line Options

Option Description
--no-tracking Disable asset tracking
--force-redownload Download all assets regardless of tracking
--show-stats Display asset statistics
--cleanup Clean up missing file metadata

Contributing

To contribute to the asset tracking system:

  1. Test Changes: Run python3 test_asset_tracking.py
  2. Update Documentation: Modify this README as needed
  3. Follow Patterns: Use existing code patterns and error handling
  4. Add Tests: Include tests for new functionality

License

This asset tracking system is part of the ParentZone Downloader project.