Files
parentzone_downloader/ASSET_TRACKING_README.md

382 lines
11 KiB
Markdown
Raw Normal View History

2025-10-07 14:52:04 +01:00
# Asset Tracking System
This document describes the asset tracking system implemented for the ParentZone Downloader, which intelligently identifies and downloads only new or modified assets, avoiding unnecessary re-downloads.
## Overview
The asset tracking system consists of two main components:
1. **AssetTracker** (`asset_tracker.py`) - Manages local metadata and identifies new/modified assets
2. **ImageDownloader Integration** - Enhanced downloader with asset tracking capabilities
## Features
### 🎯 Smart Asset Detection
- **New Assets**: Automatically detects assets that haven't been downloaded before
- **Modified Assets**: Identifies assets that have changed since last download (based on timestamp, size, etc.)
- **Unchanged Assets**: Efficiently skips assets that are already up-to-date locally
### 📊 Comprehensive Tracking
- **Metadata Storage**: Stores asset metadata in JSON format for persistence
- **File Integrity**: Tracks file sizes, modification times, and content hashes
- **Download History**: Maintains records of successful and failed downloads
### 🧹 Maintenance Features
- **Cleanup**: Removes metadata for files that no longer exist on disk
- **Statistics**: Provides detailed statistics about tracked assets
- **Validation**: Ensures consistency between metadata and actual files
## Quick Start
### Basic Usage with Asset Tracking
```bash
# Download only new/modified assets (default behavior)
python3 image_downloader.py \
--api-url "https://api.parentzone.me" \
--list-endpoint "/v1/media/list" \
--download-endpoint "/v1/media" \
--output-dir "./downloaded_images" \
--email "your-email@example.com" \
--password "your-password"
```
### Advanced Options
```bash
# Disable asset tracking (download all assets)
python3 image_downloader.py [options] --no-tracking
# Force re-download of all assets
python3 image_downloader.py [options] --force-redownload
# Show asset tracking statistics
python3 image_downloader.py [options] --show-stats
# Clean up metadata for missing files
python3 image_downloader.py [options] --cleanup
```
## Asset Tracker API
### Basic Usage
```python
from asset_tracker import AssetTracker
# Initialize tracker
tracker = AssetTracker(storage_dir="downloaded_images")
# Get new assets that need downloading
api_assets = [...] # Assets from API response
new_assets = tracker.get_new_assets(api_assets)
# Mark an asset as downloaded
tracker.mark_asset_downloaded(asset, filepath, success=True)
# Get statistics
stats = tracker.get_stats()
```
### Key Methods
#### `get_new_assets(api_assets: List[Dict]) -> List[Dict]`
Identifies new or modified assets that need to be downloaded.
**Parameters:**
- `api_assets`: List of asset dictionaries from API response
**Returns:**
- List of assets that need to be downloaded
**Example:**
```python
# API returns 100 assets, but only 5 are new/modified
api_assets = await fetch_assets_from_api()
new_assets = tracker.get_new_assets(api_assets)
print(f"Need to download {len(new_assets)} out of {len(api_assets)} assets")
```
#### `mark_asset_downloaded(asset: Dict, filepath: Path, success: bool)`
Records that an asset has been downloaded (or attempted).
**Parameters:**
- `asset`: Asset dictionary from API
- `filepath`: Local path where asset was saved
- `success`: Whether download was successful
#### `cleanup_missing_files()`
Removes metadata entries for files that no longer exist on disk.
#### `get_stats() -> Dict`
Returns comprehensive statistics about tracked assets.
**Returns:**
```python
{
'total_tracked_assets': 150,
'successful_downloads': 145,
'failed_downloads': 5,
'existing_files': 140,
'missing_files': 10,
'total_size_bytes': 524288000,
'total_size_mb': 500.0
}
```
## Metadata Storage
### File Structure
Asset metadata is stored in `{output_dir}/asset_metadata.json`:
```json
{
"asset_001": {
"asset_id": "asset_001",
"filename": "family_photo.jpg",
"filepath": "/path/to/downloaded_images/family_photo.jpg",
"download_date": "2024-01-15T10:30:00",
"success": true,
"content_hash": "d41d8cd98f00b204e9800998ecf8427e",
"file_size": 1024000,
"file_modified": "2024-01-15T10:30:00",
"api_data": {
"id": "asset_001",
"name": "family_photo.jpg",
"updated": "2024-01-01T10:00:00Z",
"size": 1024000,
"mimeType": "image/jpeg"
}
}
}
```
### Asset Identification
Assets are identified using the following priority:
1. `id` field
2. `assetId` field
3. `uuid` field
4. MD5 hash of asset data (fallback)
### Change Detection
Assets are considered modified if their content hash changes. The hash is based on:
- `updated` timestamp
- `modified` timestamp
- `lastModified` timestamp
- `size` field
- `checksum` field
- `etag` field
## Integration with ImageDownloader
### Automatic Integration
When asset tracking is enabled (default), the `ImageDownloader` automatically:
1. **Initializes Tracker**: Creates an `AssetTracker` instance
2. **Filters Assets**: Only downloads new/modified assets
3. **Records Downloads**: Marks successful/failed downloads in metadata
4. **Provides Feedback**: Shows statistics about skipped vs downloaded assets
### Example Integration
```python
from image_downloader import ImageDownloader
# Asset tracking enabled by default
downloader = ImageDownloader(
api_url="https://api.parentzone.me",
list_endpoint="/v1/media/list",
download_endpoint="/v1/media",
output_dir="./images",
email="user@example.com",
password="password",
track_assets=True # Default: True
)
# First run: Downloads all assets
await downloader.download_all_assets()
# Second run: Skips unchanged assets, downloads only new/modified ones
await downloader.download_all_assets()
```
## Testing
### Unit Tests
```bash
# Run comprehensive asset tracking tests
python3 test_asset_tracking.py
# Output shows:
# ✅ Basic tracking test passed!
# ✅ Modified asset detection test passed!
# ✅ Cleanup functionality test passed!
# ✅ Integration test completed!
```
### Live Demo
```bash
# Demonstrate asset tracking with real API
python3 demo_asset_tracking.py
# Shows:
# - Authentication process
# - Current asset status
# - First download run (downloads new assets)
# - Second run (skips all assets)
# - Final statistics
```
## Performance Benefits
### Network Efficiency
- **Reduced API Calls**: Only downloads assets that have changed
- **Bandwidth Savings**: Skips unchanged assets entirely
- **Faster Sync**: Subsequent runs complete much faster
### Storage Efficiency
- **No Duplicates**: Prevents downloading the same asset multiple times
- **Smart Cleanup**: Removes metadata for deleted files
- **Size Tracking**: Monitors total storage usage
### Example Performance Impact
```
First Run: 150 assets → Downloaded 150 (100%)
Second Run: 150 assets → Downloaded 0 (0%) - All up to date!
Third Run: 155 assets → Downloaded 5 (3.2%) - Only new ones
```
## Troubleshooting
### Common Issues
#### "No existing metadata file found"
This is normal for first-time usage. The system will create the metadata file automatically.
#### "File missing, removing from metadata"
The cleanup process found files that were deleted outside the application. This is normal maintenance.
#### Asset tracking not working
Ensure `AssetTracker` is properly imported and asset tracking is enabled:
```python
# Check if tracking is enabled
if downloader.asset_tracker:
print("Asset tracking is enabled")
else:
print("Asset tracking is disabled")
```
### Manual Maintenance
#### Reset All Tracking
```bash
# Remove metadata file to start fresh
rm downloaded_images/asset_metadata.json
```
#### Clean Up Missing Files
```bash
python3 image_downloader.py --cleanup --output-dir "./downloaded_images"
```
#### View Statistics
```bash
python3 image_downloader.py --show-stats --output-dir "./downloaded_images"
```
## Configuration
### Environment Variables
```bash
# Disable asset tracking globally
export DISABLE_ASSET_TRACKING=1
# Set custom metadata filename
export ASSET_METADATA_FILE="my_assets.json"
```
### Programmatic Configuration
```python
# Custom metadata file location
tracker = AssetTracker(
storage_dir="./images",
metadata_file="custom_metadata.json"
)
# Disable tracking for specific downloader
downloader = ImageDownloader(
# ... other params ...
track_assets=False
)
```
## Future Enhancements
### Planned Features
- **Parallel Metadata Updates**: Concurrent metadata operations
- **Cloud Sync**: Sync metadata across multiple devices
- **Asset Versioning**: Track multiple versions of the same asset
- **Batch Operations**: Bulk metadata operations for large datasets
- **Web Interface**: Browser-based asset management
### Extensibility
The asset tracking system is designed to be extensible:
```python
# Custom asset identification
class CustomAssetTracker(AssetTracker):
def _get_asset_key(self, asset):
# Custom logic for asset identification
return f"{asset.get('category')}_{asset.get('id')}"
def _get_asset_hash(self, asset):
# Custom logic for change detection
return super()._get_asset_hash(asset)
```
## API Reference
### AssetTracker Class
| Method | Description | Parameters | Returns |
|--------|-------------|------------|---------|
| `__init__` | Initialize tracker | `storage_dir`, `metadata_file` | None |
| `get_new_assets` | Find new/modified assets | `api_assets: List[Dict]` | `List[Dict]` |
| `mark_asset_downloaded` | Record download | `asset`, `filepath`, `success` | None |
| `is_asset_downloaded` | Check if downloaded | `asset: Dict` | `bool` |
| `is_asset_modified` | Check if modified | `asset: Dict` | `bool` |
| `cleanup_missing_files` | Remove stale metadata | None | None |
| `get_stats` | Get statistics | None | `Dict` |
| `print_stats` | Print formatted stats | None | None |
### ImageDownloader Integration
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `track_assets` | `bool` | `True` | Enable asset tracking |
| Method | Description | Parameters |
|--------|-------------|------------|
| `download_all_assets` | Download assets | `force_redownload: bool = False` |
### Command Line Options
| Option | Description |
|--------|-------------|
| `--no-tracking` | Disable asset tracking |
| `--force-redownload` | Download all assets regardless of tracking |
| `--show-stats` | Display asset statistics |
| `--cleanup` | Clean up missing file metadata |
## Contributing
To contribute to the asset tracking system:
1. **Test Changes**: Run `python3 test_asset_tracking.py`
2. **Update Documentation**: Modify this README as needed
3. **Follow Patterns**: Use existing code patterns and error handling
4. **Add Tests**: Include tests for new functionality
## License
This asset tracking system is part of the ParentZone Downloader project.