All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
Adds title, description, type, domain, and tags frontmatter to every doc for improved KB semantic search. The description field is prepended to every search chunk, and domain/type/tags enable filtered queries. Type values: context, guide, runbook, reference, troubleshooting Domain values match directory structure (networking, docker, etc.) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2.8 KiB
2.8 KiB
| title | description | type | domain | tags | |||||
|---|---|---|---|---|---|---|---|---|---|
| Media Tools Overview | Directory overview for media downloading tools using Playwright browser automation and yt-dlp, covering architecture patterns, anti-bot handling, and state management. | context | media-tools |
|
Media Tools
Tools for downloading and managing media from streaming sites.
Overview
This directory contains utilities for:
- Extracting video URLs from streaming sites using browser automation
- Downloading videos via yt-dlp
- Managing download state for resumable operations
Tools
pokeflix_scraper.py
Downloads Pokemon episodes from pokeflix.tv using Playwright for browser automation.
Location: scripts/pokeflix_scraper.py
Features:
- Extracts episode lists from season pages
- Handles iframe-embedded video players (Streamtape, Vidoza, etc.)
- Resumable downloads with state persistence
- Configurable episode ranges
- Dry-run mode for testing
Architecture Pattern
These tools follow a common pattern:
┌─────────────────┐ ┌──────────────────┐ ┌─────────────┐
│ Playwright │────▶│ Extract embed │────▶│ yt-dlp │
│ (navigate) │ │ video URLs │ │ (download) │
└─────────────────┘ └──────────────────┘ └─────────────┘
Why this approach:
- Playwright handles JavaScript-heavy sites that block simple HTTP requests
- Iframe extraction works around sites that use third-party video hosts
- yt-dlp is the de-facto standard for video downloading with broad host support
Dependencies
# Python packages
pip install playwright yt-dlp
# Playwright browser installation
playwright install chromium
Common Patterns
Anti-Bot Handling
- Use headed browser mode (visible window) initially
- Random delays between requests (2-5 seconds)
- Realistic viewport and user-agent settings
- Wait for
networkidlestate after navigation
State Management
- JSON state files track downloaded episodes
- Enable
--resumeflag to skip completed downloads - State includes error information for debugging
Output Organization
{output_dir}/
├── {Season Name}/
│ ├── E01 - Episode Title.mp4
│ ├── E02 - Episode Title.mp4
│ └── download_state.json
When to Use These Tools
- Downloading entire seasons of shows for offline viewing
- Archiving content before it becomes unavailable
- Building a local media library
Legal Considerations
These tools are for personal archival use. Respect copyright laws in your jurisdiction.