382 lines
11 KiB
Markdown
382 lines
11 KiB
Markdown
# Asset Tracking System
|
|
|
|
This document describes the asset tracking system implemented for the ParentZone Downloader, which intelligently identifies and downloads only new or modified assets, avoiding unnecessary re-downloads.
|
|
|
|
## Overview
|
|
|
|
The asset tracking system consists of two main components:
|
|
|
|
1. **AssetTracker** (`asset_tracker.py`) - Manages local metadata and identifies new/modified assets
|
|
2. **ImageDownloader Integration** - Enhanced downloader with asset tracking capabilities
|
|
|
|
## Features
|
|
|
|
### 🎯 Smart Asset Detection
|
|
- **New Assets**: Automatically detects assets that haven't been downloaded before
|
|
- **Modified Assets**: Identifies assets that have changed since last download (based on timestamp, size, etc.)
|
|
- **Unchanged Assets**: Efficiently skips assets that are already up-to-date locally
|
|
|
|
### 📊 Comprehensive Tracking
|
|
- **Metadata Storage**: Stores asset metadata in JSON format for persistence
|
|
- **File Integrity**: Tracks file sizes, modification times, and content hashes
|
|
- **Download History**: Maintains records of successful and failed downloads
|
|
|
|
### 🧹 Maintenance Features
|
|
- **Cleanup**: Removes metadata for files that no longer exist on disk
|
|
- **Statistics**: Provides detailed statistics about tracked assets
|
|
- **Validation**: Ensures consistency between metadata and actual files
|
|
|
|
## Quick Start
|
|
|
|
### Basic Usage with Asset Tracking
|
|
|
|
```bash
|
|
# Download only new/modified assets (default behavior)
|
|
python3 image_downloader.py \
|
|
--api-url "https://api.parentzone.me" \
|
|
--list-endpoint "/v1/media/list" \
|
|
--download-endpoint "/v1/media" \
|
|
--output-dir "./downloaded_images" \
|
|
--email "your-email@example.com" \
|
|
--password "your-password"
|
|
```
|
|
|
|
### Advanced Options
|
|
|
|
```bash
|
|
# Disable asset tracking (download all assets)
|
|
python3 image_downloader.py [options] --no-tracking
|
|
|
|
# Force re-download of all assets
|
|
python3 image_downloader.py [options] --force-redownload
|
|
|
|
# Show asset tracking statistics
|
|
python3 image_downloader.py [options] --show-stats
|
|
|
|
# Clean up metadata for missing files
|
|
python3 image_downloader.py [options] --cleanup
|
|
```
|
|
|
|
## Asset Tracker API
|
|
|
|
### Basic Usage
|
|
|
|
```python
|
|
from asset_tracker import AssetTracker
|
|
|
|
# Initialize tracker
|
|
tracker = AssetTracker(storage_dir="downloaded_images")
|
|
|
|
# Get new assets that need downloading
|
|
api_assets = [...] # Assets from API response
|
|
new_assets = tracker.get_new_assets(api_assets)
|
|
|
|
# Mark an asset as downloaded
|
|
tracker.mark_asset_downloaded(asset, filepath, success=True)
|
|
|
|
# Get statistics
|
|
stats = tracker.get_stats()
|
|
```
|
|
|
|
### Key Methods
|
|
|
|
#### `get_new_assets(api_assets: List[Dict]) -> List[Dict]`
|
|
Identifies new or modified assets that need to be downloaded.
|
|
|
|
**Parameters:**
|
|
- `api_assets`: List of asset dictionaries from API response
|
|
|
|
**Returns:**
|
|
- List of assets that need to be downloaded
|
|
|
|
**Example:**
|
|
```python
|
|
# API returns 100 assets, but only 5 are new/modified
|
|
api_assets = await fetch_assets_from_api()
|
|
new_assets = tracker.get_new_assets(api_assets)
|
|
print(f"Need to download {len(new_assets)} out of {len(api_assets)} assets")
|
|
```
|
|
|
|
#### `mark_asset_downloaded(asset: Dict, filepath: Path, success: bool)`
|
|
Records that an asset has been downloaded (or attempted).
|
|
|
|
**Parameters:**
|
|
- `asset`: Asset dictionary from API
|
|
- `filepath`: Local path where asset was saved
|
|
- `success`: Whether download was successful
|
|
|
|
#### `cleanup_missing_files()`
|
|
Removes metadata entries for files that no longer exist on disk.
|
|
|
|
#### `get_stats() -> Dict`
|
|
Returns comprehensive statistics about tracked assets.
|
|
|
|
**Returns:**
|
|
```python
|
|
{
|
|
'total_tracked_assets': 150,
|
|
'successful_downloads': 145,
|
|
'failed_downloads': 5,
|
|
'existing_files': 140,
|
|
'missing_files': 10,
|
|
'total_size_bytes': 524288000,
|
|
'total_size_mb': 500.0
|
|
}
|
|
```
|
|
|
|
## Metadata Storage
|
|
|
|
### File Structure
|
|
Asset metadata is stored in `{output_dir}/asset_metadata.json`:
|
|
|
|
```json
|
|
{
|
|
"asset_001": {
|
|
"asset_id": "asset_001",
|
|
"filename": "family_photo.jpg",
|
|
"filepath": "/path/to/downloaded_images/family_photo.jpg",
|
|
"download_date": "2024-01-15T10:30:00",
|
|
"success": true,
|
|
"content_hash": "d41d8cd98f00b204e9800998ecf8427e",
|
|
"file_size": 1024000,
|
|
"file_modified": "2024-01-15T10:30:00",
|
|
"api_data": {
|
|
"id": "asset_001",
|
|
"name": "family_photo.jpg",
|
|
"updated": "2024-01-01T10:00:00Z",
|
|
"size": 1024000,
|
|
"mimeType": "image/jpeg"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### Asset Identification
|
|
Assets are identified using the following priority:
|
|
1. `id` field
|
|
2. `assetId` field
|
|
3. `uuid` field
|
|
4. MD5 hash of asset data (fallback)
|
|
|
|
### Change Detection
|
|
Assets are considered modified if their content hash changes. The hash is based on:
|
|
- `updated` timestamp
|
|
- `modified` timestamp
|
|
- `lastModified` timestamp
|
|
- `size` field
|
|
- `checksum` field
|
|
- `etag` field
|
|
|
|
## Integration with ImageDownloader
|
|
|
|
### Automatic Integration
|
|
When asset tracking is enabled (default), the `ImageDownloader` automatically:
|
|
|
|
1. **Initializes Tracker**: Creates an `AssetTracker` instance
|
|
2. **Filters Assets**: Only downloads new/modified assets
|
|
3. **Records Downloads**: Marks successful/failed downloads in metadata
|
|
4. **Provides Feedback**: Shows statistics about skipped vs downloaded assets
|
|
|
|
### Example Integration
|
|
|
|
```python
|
|
from image_downloader import ImageDownloader
|
|
|
|
# Asset tracking enabled by default
|
|
downloader = ImageDownloader(
|
|
api_url="https://api.parentzone.me",
|
|
list_endpoint="/v1/media/list",
|
|
download_endpoint="/v1/media",
|
|
output_dir="./images",
|
|
email="user@example.com",
|
|
password="password",
|
|
track_assets=True # Default: True
|
|
)
|
|
|
|
# First run: Downloads all assets
|
|
await downloader.download_all_assets()
|
|
|
|
# Second run: Skips unchanged assets, downloads only new/modified ones
|
|
await downloader.download_all_assets()
|
|
```
|
|
|
|
## Testing
|
|
|
|
### Unit Tests
|
|
```bash
|
|
# Run comprehensive asset tracking tests
|
|
python3 test_asset_tracking.py
|
|
|
|
# Output shows:
|
|
# ✅ Basic tracking test passed!
|
|
# ✅ Modified asset detection test passed!
|
|
# ✅ Cleanup functionality test passed!
|
|
# ✅ Integration test completed!
|
|
```
|
|
|
|
### Live Demo
|
|
```bash
|
|
# Demonstrate asset tracking with real API
|
|
python3 demo_asset_tracking.py
|
|
|
|
# Shows:
|
|
# - Authentication process
|
|
# - Current asset status
|
|
# - First download run (downloads new assets)
|
|
# - Second run (skips all assets)
|
|
# - Final statistics
|
|
```
|
|
|
|
## Performance Benefits
|
|
|
|
### Network Efficiency
|
|
- **Reduced API Calls**: Only downloads assets that have changed
|
|
- **Bandwidth Savings**: Skips unchanged assets entirely
|
|
- **Faster Sync**: Subsequent runs complete much faster
|
|
|
|
### Storage Efficiency
|
|
- **No Duplicates**: Prevents downloading the same asset multiple times
|
|
- **Smart Cleanup**: Removes metadata for deleted files
|
|
- **Size Tracking**: Monitors total storage usage
|
|
|
|
### Example Performance Impact
|
|
```
|
|
First Run: 150 assets → Downloaded 150 (100%)
|
|
Second Run: 150 assets → Downloaded 0 (0%) - All up to date!
|
|
Third Run: 155 assets → Downloaded 5 (3.2%) - Only new ones
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
#### "No existing metadata file found"
|
|
This is normal for first-time usage. The system will create the metadata file automatically.
|
|
|
|
#### "File missing, removing from metadata"
|
|
The cleanup process found files that were deleted outside the application. This is normal maintenance.
|
|
|
|
#### Asset tracking not working
|
|
Ensure `AssetTracker` is properly imported and asset tracking is enabled:
|
|
```python
|
|
# Check if tracking is enabled
|
|
if downloader.asset_tracker:
|
|
print("Asset tracking is enabled")
|
|
else:
|
|
print("Asset tracking is disabled")
|
|
```
|
|
|
|
### Manual Maintenance
|
|
|
|
#### Reset All Tracking
|
|
```bash
|
|
# Remove metadata file to start fresh
|
|
rm downloaded_images/asset_metadata.json
|
|
```
|
|
|
|
#### Clean Up Missing Files
|
|
```bash
|
|
python3 image_downloader.py --cleanup --output-dir "./downloaded_images"
|
|
```
|
|
|
|
#### View Statistics
|
|
```bash
|
|
python3 image_downloader.py --show-stats --output-dir "./downloaded_images"
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables
|
|
```bash
|
|
# Disable asset tracking globally
|
|
export DISABLE_ASSET_TRACKING=1
|
|
|
|
# Set custom metadata filename
|
|
export ASSET_METADATA_FILE="my_assets.json"
|
|
```
|
|
|
|
### Programmatic Configuration
|
|
```python
|
|
# Custom metadata file location
|
|
tracker = AssetTracker(
|
|
storage_dir="./images",
|
|
metadata_file="custom_metadata.json"
|
|
)
|
|
|
|
# Disable tracking for specific downloader
|
|
downloader = ImageDownloader(
|
|
# ... other params ...
|
|
track_assets=False
|
|
)
|
|
```
|
|
|
|
## Future Enhancements
|
|
|
|
### Planned Features
|
|
- **Parallel Metadata Updates**: Concurrent metadata operations
|
|
- **Cloud Sync**: Sync metadata across multiple devices
|
|
- **Asset Versioning**: Track multiple versions of the same asset
|
|
- **Batch Operations**: Bulk metadata operations for large datasets
|
|
- **Web Interface**: Browser-based asset management
|
|
|
|
### Extensibility
|
|
The asset tracking system is designed to be extensible:
|
|
|
|
```python
|
|
# Custom asset identification
|
|
class CustomAssetTracker(AssetTracker):
|
|
def _get_asset_key(self, asset):
|
|
# Custom logic for asset identification
|
|
return f"{asset.get('category')}_{asset.get('id')}"
|
|
|
|
def _get_asset_hash(self, asset):
|
|
# Custom logic for change detection
|
|
return super()._get_asset_hash(asset)
|
|
```
|
|
|
|
## API Reference
|
|
|
|
### AssetTracker Class
|
|
|
|
| Method | Description | Parameters | Returns |
|
|
|--------|-------------|------------|---------|
|
|
| `__init__` | Initialize tracker | `storage_dir`, `metadata_file` | None |
|
|
| `get_new_assets` | Find new/modified assets | `api_assets: List[Dict]` | `List[Dict]` |
|
|
| `mark_asset_downloaded` | Record download | `asset`, `filepath`, `success` | None |
|
|
| `is_asset_downloaded` | Check if downloaded | `asset: Dict` | `bool` |
|
|
| `is_asset_modified` | Check if modified | `asset: Dict` | `bool` |
|
|
| `cleanup_missing_files` | Remove stale metadata | None | None |
|
|
| `get_stats` | Get statistics | None | `Dict` |
|
|
| `print_stats` | Print formatted stats | None | None |
|
|
|
|
### ImageDownloader Integration
|
|
|
|
| Parameter | Type | Default | Description |
|
|
|-----------|------|---------|-------------|
|
|
| `track_assets` | `bool` | `True` | Enable asset tracking |
|
|
|
|
| Method | Description | Parameters |
|
|
|--------|-------------|------------|
|
|
| `download_all_assets` | Download assets | `force_redownload: bool = False` |
|
|
|
|
### Command Line Options
|
|
|
|
| Option | Description |
|
|
|--------|-------------|
|
|
| `--no-tracking` | Disable asset tracking |
|
|
| `--force-redownload` | Download all assets regardless of tracking |
|
|
| `--show-stats` | Display asset statistics |
|
|
| `--cleanup` | Clean up missing file metadata |
|
|
|
|
## Contributing
|
|
|
|
To contribute to the asset tracking system:
|
|
|
|
1. **Test Changes**: Run `python3 test_asset_tracking.py`
|
|
2. **Update Documentation**: Modify this README as needed
|
|
3. **Follow Patterns**: Use existing code patterns and error handling
|
|
4. **Add Tests**: Include tests for new functionality
|
|
|
|
## License
|
|
|
|
This asset tracking system is part of the ParentZone Downloader project. |