claude-home/media-tools/CONTEXT.md
Cal Corum 4b7eca8a46
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
docs: add YAML frontmatter to all 151 markdown files
Adds title, description, type, domain, and tags frontmatter to every
doc for improved KB semantic search. The description field is prepended
to every search chunk, and domain/type/tags enable filtered queries.

Type values: context, guide, runbook, reference, troubleshooting
Domain values match directory structure (networking, docker, etc.)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 09:00:44 -05:00

91 lines
2.8 KiB
Markdown

---
title: "Media Tools Overview"
description: "Directory overview for media downloading tools using Playwright browser automation and yt-dlp, covering architecture patterns, anti-bot handling, and state management."
type: context
domain: media-tools
tags: [yt-dlp, playwright, web-scraping, video-download, browser-automation]
---
# Media Tools
Tools for downloading and managing media from streaming sites.
## Overview
This directory contains utilities for:
- Extracting video URLs from streaming sites using browser automation
- Downloading videos via yt-dlp
- Managing download state for resumable operations
## Tools
### pokeflix_scraper.py
Downloads Pokemon episodes from pokeflix.tv using Playwright for browser automation.
**Location:** `scripts/pokeflix_scraper.py`
**Features:**
- Extracts episode lists from season pages
- Handles iframe-embedded video players (Streamtape, Vidoza, etc.)
- Resumable downloads with state persistence
- Configurable episode ranges
- Dry-run mode for testing
## Architecture Pattern
These tools follow a common pattern:
```
┌─────────────────┐ ┌──────────────────┐ ┌─────────────┐
│ Playwright │────▶│ Extract embed │────▶│ yt-dlp │
│ (navigate) │ │ video URLs │ │ (download) │
└─────────────────┘ └──────────────────┘ └─────────────┘
```
**Why this approach:**
1. **Playwright** handles JavaScript-heavy sites that block simple HTTP requests
2. **Iframe extraction** works around sites that use third-party video hosts
3. **yt-dlp** is the de-facto standard for video downloading with broad host support
## Dependencies
```bash
# Python packages
pip install playwright yt-dlp
# Playwright browser installation
playwright install chromium
```
## Common Patterns
### Anti-Bot Handling
- Use headed browser mode (visible window) initially
- Random delays between requests (2-5 seconds)
- Realistic viewport and user-agent settings
- Wait for `networkidle` state after navigation
### State Management
- JSON state files track downloaded episodes
- Enable `--resume` flag to skip completed downloads
- State includes error information for debugging
### Output Organization
```
{output_dir}/
├── {Season Name}/
│ ├── E01 - Episode Title.mp4
│ ├── E02 - Episode Title.mp4
│ └── download_state.json
```
## When to Use These Tools
- Downloading entire seasons of shows for offline viewing
- Archiving content before it becomes unavailable
- Building a local media library
## Legal Considerations
These tools are for personal archival use. Respect copyright laws in your jurisdiction.