first commit
This commit is contained in:
382
ASSET_TRACKING_README.md
Normal file
382
ASSET_TRACKING_README.md
Normal file
@@ -0,0 +1,382 @@
|
||||
# Asset Tracking System
|
||||
|
||||
This document describes the asset tracking system implemented for the ParentZone Downloader, which intelligently identifies and downloads only new or modified assets, avoiding unnecessary re-downloads.
|
||||
|
||||
## Overview
|
||||
|
||||
The asset tracking system consists of two main components:
|
||||
|
||||
1. **AssetTracker** (`asset_tracker.py`) - Manages local metadata and identifies new/modified assets
|
||||
2. **ImageDownloader Integration** - Enhanced downloader with asset tracking capabilities
|
||||
|
||||
## Features
|
||||
|
||||
### 🎯 Smart Asset Detection
|
||||
- **New Assets**: Automatically detects assets that haven't been downloaded before
|
||||
- **Modified Assets**: Identifies assets that have changed since last download (based on timestamp, size, etc.)
|
||||
- **Unchanged Assets**: Efficiently skips assets that are already up-to-date locally
|
||||
|
||||
### 📊 Comprehensive Tracking
|
||||
- **Metadata Storage**: Stores asset metadata in JSON format for persistence
|
||||
- **File Integrity**: Tracks file sizes, modification times, and content hashes
|
||||
- **Download History**: Maintains records of successful and failed downloads
|
||||
|
||||
### 🧹 Maintenance Features
|
||||
- **Cleanup**: Removes metadata for files that no longer exist on disk
|
||||
- **Statistics**: Provides detailed statistics about tracked assets
|
||||
- **Validation**: Ensures consistency between metadata and actual files
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Basic Usage with Asset Tracking
|
||||
|
||||
```bash
|
||||
# Download only new/modified assets (default behavior)
|
||||
python3 image_downloader.py \
|
||||
--api-url "https://api.parentzone.me" \
|
||||
--list-endpoint "/v1/media/list" \
|
||||
--download-endpoint "/v1/media" \
|
||||
--output-dir "./downloaded_images" \
|
||||
--email "your-email@example.com" \
|
||||
--password "your-password"
|
||||
```
|
||||
|
||||
### Advanced Options
|
||||
|
||||
```bash
|
||||
# Disable asset tracking (download all assets)
|
||||
python3 image_downloader.py [options] --no-tracking
|
||||
|
||||
# Force re-download of all assets
|
||||
python3 image_downloader.py [options] --force-redownload
|
||||
|
||||
# Show asset tracking statistics
|
||||
python3 image_downloader.py [options] --show-stats
|
||||
|
||||
# Clean up metadata for missing files
|
||||
python3 image_downloader.py [options] --cleanup
|
||||
```
|
||||
|
||||
## Asset Tracker API
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```python
|
||||
from asset_tracker import AssetTracker
|
||||
|
||||
# Initialize tracker
|
||||
tracker = AssetTracker(storage_dir="downloaded_images")
|
||||
|
||||
# Get new assets that need downloading
|
||||
api_assets = [...] # Assets from API response
|
||||
new_assets = tracker.get_new_assets(api_assets)
|
||||
|
||||
# Mark an asset as downloaded
|
||||
tracker.mark_asset_downloaded(asset, filepath, success=True)
|
||||
|
||||
# Get statistics
|
||||
stats = tracker.get_stats()
|
||||
```
|
||||
|
||||
### Key Methods
|
||||
|
||||
#### `get_new_assets(api_assets: List[Dict]) -> List[Dict]`
|
||||
Identifies new or modified assets that need to be downloaded.
|
||||
|
||||
**Parameters:**
|
||||
- `api_assets`: List of asset dictionaries from API response
|
||||
|
||||
**Returns:**
|
||||
- List of assets that need to be downloaded
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
# API returns 100 assets, but only 5 are new/modified
|
||||
api_assets = await fetch_assets_from_api()
|
||||
new_assets = tracker.get_new_assets(api_assets)
|
||||
print(f"Need to download {len(new_assets)} out of {len(api_assets)} assets")
|
||||
```
|
||||
|
||||
#### `mark_asset_downloaded(asset: Dict, filepath: Path, success: bool)`
|
||||
Records that an asset has been downloaded (or attempted).
|
||||
|
||||
**Parameters:**
|
||||
- `asset`: Asset dictionary from API
|
||||
- `filepath`: Local path where asset was saved
|
||||
- `success`: Whether download was successful
|
||||
|
||||
#### `cleanup_missing_files()`
|
||||
Removes metadata entries for files that no longer exist on disk.
|
||||
|
||||
#### `get_stats() -> Dict`
|
||||
Returns comprehensive statistics about tracked assets.
|
||||
|
||||
**Returns:**
|
||||
```python
|
||||
{
|
||||
'total_tracked_assets': 150,
|
||||
'successful_downloads': 145,
|
||||
'failed_downloads': 5,
|
||||
'existing_files': 140,
|
||||
'missing_files': 10,
|
||||
'total_size_bytes': 524288000,
|
||||
'total_size_mb': 500.0
|
||||
}
|
||||
```
|
||||
|
||||
## Metadata Storage
|
||||
|
||||
### File Structure
|
||||
Asset metadata is stored in `{output_dir}/asset_metadata.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"asset_001": {
|
||||
"asset_id": "asset_001",
|
||||
"filename": "family_photo.jpg",
|
||||
"filepath": "/path/to/downloaded_images/family_photo.jpg",
|
||||
"download_date": "2024-01-15T10:30:00",
|
||||
"success": true,
|
||||
"content_hash": "d41d8cd98f00b204e9800998ecf8427e",
|
||||
"file_size": 1024000,
|
||||
"file_modified": "2024-01-15T10:30:00",
|
||||
"api_data": {
|
||||
"id": "asset_001",
|
||||
"name": "family_photo.jpg",
|
||||
"updated": "2024-01-01T10:00:00Z",
|
||||
"size": 1024000,
|
||||
"mimeType": "image/jpeg"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Asset Identification
|
||||
Assets are identified using the following priority:
|
||||
1. `id` field
|
||||
2. `assetId` field
|
||||
3. `uuid` field
|
||||
4. MD5 hash of asset data (fallback)
|
||||
|
||||
### Change Detection
|
||||
Assets are considered modified if their content hash changes. The hash is based on:
|
||||
- `updated` timestamp
|
||||
- `modified` timestamp
|
||||
- `lastModified` timestamp
|
||||
- `size` field
|
||||
- `checksum` field
|
||||
- `etag` field
|
||||
|
||||
## Integration with ImageDownloader
|
||||
|
||||
### Automatic Integration
|
||||
When asset tracking is enabled (default), the `ImageDownloader` automatically:
|
||||
|
||||
1. **Initializes Tracker**: Creates an `AssetTracker` instance
|
||||
2. **Filters Assets**: Only downloads new/modified assets
|
||||
3. **Records Downloads**: Marks successful/failed downloads in metadata
|
||||
4. **Provides Feedback**: Shows statistics about skipped vs downloaded assets
|
||||
|
||||
### Example Integration
|
||||
|
||||
```python
|
||||
from image_downloader import ImageDownloader
|
||||
|
||||
# Asset tracking enabled by default
|
||||
downloader = ImageDownloader(
|
||||
api_url="https://api.parentzone.me",
|
||||
list_endpoint="/v1/media/list",
|
||||
download_endpoint="/v1/media",
|
||||
output_dir="./images",
|
||||
email="user@example.com",
|
||||
password="password",
|
||||
track_assets=True # Default: True
|
||||
)
|
||||
|
||||
# First run: Downloads all assets
|
||||
await downloader.download_all_assets()
|
||||
|
||||
# Second run: Skips unchanged assets, downloads only new/modified ones
|
||||
await downloader.download_all_assets()
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
### Unit Tests
|
||||
```bash
|
||||
# Run comprehensive asset tracking tests
|
||||
python3 test_asset_tracking.py
|
||||
|
||||
# Output shows:
|
||||
# ✅ Basic tracking test passed!
|
||||
# ✅ Modified asset detection test passed!
|
||||
# ✅ Cleanup functionality test passed!
|
||||
# ✅ Integration test completed!
|
||||
```
|
||||
|
||||
### Live Demo
|
||||
```bash
|
||||
# Demonstrate asset tracking with real API
|
||||
python3 demo_asset_tracking.py
|
||||
|
||||
# Shows:
|
||||
# - Authentication process
|
||||
# - Current asset status
|
||||
# - First download run (downloads new assets)
|
||||
# - Second run (skips all assets)
|
||||
# - Final statistics
|
||||
```
|
||||
|
||||
## Performance Benefits
|
||||
|
||||
### Network Efficiency
|
||||
- **Reduced API Calls**: Only downloads assets that have changed
|
||||
- **Bandwidth Savings**: Skips unchanged assets entirely
|
||||
- **Faster Sync**: Subsequent runs complete much faster
|
||||
|
||||
### Storage Efficiency
|
||||
- **No Duplicates**: Prevents downloading the same asset multiple times
|
||||
- **Smart Cleanup**: Removes metadata for deleted files
|
||||
- **Size Tracking**: Monitors total storage usage
|
||||
|
||||
### Example Performance Impact
|
||||
```
|
||||
First Run: 150 assets → Downloaded 150 (100%)
|
||||
Second Run: 150 assets → Downloaded 0 (0%) - All up to date!
|
||||
Third Run: 155 assets → Downloaded 5 (3.2%) - Only new ones
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
#### "No existing metadata file found"
|
||||
This is normal for first-time usage. The system will create the metadata file automatically.
|
||||
|
||||
#### "File missing, removing from metadata"
|
||||
The cleanup process found files that were deleted outside the application. This is normal maintenance.
|
||||
|
||||
#### Asset tracking not working
|
||||
Ensure `AssetTracker` is properly imported and asset tracking is enabled:
|
||||
```python
|
||||
# Check if tracking is enabled
|
||||
if downloader.asset_tracker:
|
||||
print("Asset tracking is enabled")
|
||||
else:
|
||||
print("Asset tracking is disabled")
|
||||
```
|
||||
|
||||
### Manual Maintenance
|
||||
|
||||
#### Reset All Tracking
|
||||
```bash
|
||||
# Remove metadata file to start fresh
|
||||
rm downloaded_images/asset_metadata.json
|
||||
```
|
||||
|
||||
#### Clean Up Missing Files
|
||||
```bash
|
||||
python3 image_downloader.py --cleanup --output-dir "./downloaded_images"
|
||||
```
|
||||
|
||||
#### View Statistics
|
||||
```bash
|
||||
python3 image_downloader.py --show-stats --output-dir "./downloaded_images"
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### Environment Variables
|
||||
```bash
|
||||
# Disable asset tracking globally
|
||||
export DISABLE_ASSET_TRACKING=1
|
||||
|
||||
# Set custom metadata filename
|
||||
export ASSET_METADATA_FILE="my_assets.json"
|
||||
```
|
||||
|
||||
### Programmatic Configuration
|
||||
```python
|
||||
# Custom metadata file location
|
||||
tracker = AssetTracker(
|
||||
storage_dir="./images",
|
||||
metadata_file="custom_metadata.json"
|
||||
)
|
||||
|
||||
# Disable tracking for specific downloader
|
||||
downloader = ImageDownloader(
|
||||
# ... other params ...
|
||||
track_assets=False
|
||||
)
|
||||
```
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
### Planned Features
|
||||
- **Parallel Metadata Updates**: Concurrent metadata operations
|
||||
- **Cloud Sync**: Sync metadata across multiple devices
|
||||
- **Asset Versioning**: Track multiple versions of the same asset
|
||||
- **Batch Operations**: Bulk metadata operations for large datasets
|
||||
- **Web Interface**: Browser-based asset management
|
||||
|
||||
### Extensibility
|
||||
The asset tracking system is designed to be extensible:
|
||||
|
||||
```python
|
||||
# Custom asset identification
|
||||
class CustomAssetTracker(AssetTracker):
|
||||
def _get_asset_key(self, asset):
|
||||
# Custom logic for asset identification
|
||||
return f"{asset.get('category')}_{asset.get('id')}"
|
||||
|
||||
def _get_asset_hash(self, asset):
|
||||
# Custom logic for change detection
|
||||
return super()._get_asset_hash(asset)
|
||||
```
|
||||
|
||||
## API Reference
|
||||
|
||||
### AssetTracker Class
|
||||
|
||||
| Method | Description | Parameters | Returns |
|
||||
|--------|-------------|------------|---------|
|
||||
| `__init__` | Initialize tracker | `storage_dir`, `metadata_file` | None |
|
||||
| `get_new_assets` | Find new/modified assets | `api_assets: List[Dict]` | `List[Dict]` |
|
||||
| `mark_asset_downloaded` | Record download | `asset`, `filepath`, `success` | None |
|
||||
| `is_asset_downloaded` | Check if downloaded | `asset: Dict` | `bool` |
|
||||
| `is_asset_modified` | Check if modified | `asset: Dict` | `bool` |
|
||||
| `cleanup_missing_files` | Remove stale metadata | None | None |
|
||||
| `get_stats` | Get statistics | None | `Dict` |
|
||||
| `print_stats` | Print formatted stats | None | None |
|
||||
|
||||
### ImageDownloader Integration
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|------|---------|-------------|
|
||||
| `track_assets` | `bool` | `True` | Enable asset tracking |
|
||||
|
||||
| Method | Description | Parameters |
|
||||
|--------|-------------|------------|
|
||||
| `download_all_assets` | Download assets | `force_redownload: bool = False` |
|
||||
|
||||
### Command Line Options
|
||||
|
||||
| Option | Description |
|
||||
|--------|-------------|
|
||||
| `--no-tracking` | Disable asset tracking |
|
||||
| `--force-redownload` | Download all assets regardless of tracking |
|
||||
| `--show-stats` | Display asset statistics |
|
||||
| `--cleanup` | Clean up missing file metadata |
|
||||
|
||||
## Contributing
|
||||
|
||||
To contribute to the asset tracking system:
|
||||
|
||||
1. **Test Changes**: Run `python3 test_asset_tracking.py`
|
||||
2. **Update Documentation**: Modify this README as needed
|
||||
3. **Follow Patterns**: Use existing code patterns and error handling
|
||||
4. **Add Tests**: Include tests for new functionality
|
||||
|
||||
## License
|
||||
|
||||
This asset tracking system is part of the ParentZone Downloader project.
|
||||
Reference in New Issue
Block a user