All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
Adds title, description, type, domain, and tags frontmatter to every doc for improved KB semantic search. The description field is prepended to every search chunk, and domain/type/tags enable filtered queries. Type values: context, guide, runbook, reference, troubleshooting Domain values match directory structure (networking, docker, etc.) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
91 lines
2.8 KiB
Markdown
91 lines
2.8 KiB
Markdown
---
|
|
title: "Media Tools Overview"
|
|
description: "Directory overview for media downloading tools using Playwright browser automation and yt-dlp, covering architecture patterns, anti-bot handling, and state management."
|
|
type: context
|
|
domain: media-tools
|
|
tags: [yt-dlp, playwright, web-scraping, video-download, browser-automation]
|
|
---
|
|
|
|
# Media Tools
|
|
|
|
Tools for downloading and managing media from streaming sites.
|
|
|
|
## Overview
|
|
|
|
This directory contains utilities for:
|
|
- Extracting video URLs from streaming sites using browser automation
|
|
- Downloading videos via yt-dlp
|
|
- Managing download state for resumable operations
|
|
|
|
## Tools
|
|
|
|
### pokeflix_scraper.py
|
|
Downloads Pokemon episodes from pokeflix.tv using Playwright for browser automation.
|
|
|
|
**Location:** `scripts/pokeflix_scraper.py`
|
|
|
|
**Features:**
|
|
- Extracts episode lists from season pages
|
|
- Handles iframe-embedded video players (Streamtape, Vidoza, etc.)
|
|
- Resumable downloads with state persistence
|
|
- Configurable episode ranges
|
|
- Dry-run mode for testing
|
|
|
|
## Architecture Pattern
|
|
|
|
These tools follow a common pattern:
|
|
|
|
```
|
|
┌─────────────────┐ ┌──────────────────┐ ┌─────────────┐
|
|
│ Playwright │────▶│ Extract embed │────▶│ yt-dlp │
|
|
│ (navigate) │ │ video URLs │ │ (download) │
|
|
└─────────────────┘ └──────────────────┘ └─────────────┘
|
|
```
|
|
|
|
**Why this approach:**
|
|
1. **Playwright** handles JavaScript-heavy sites that block simple HTTP requests
|
|
2. **Iframe extraction** works around sites that use third-party video hosts
|
|
3. **yt-dlp** is the de-facto standard for video downloading with broad host support
|
|
|
|
## Dependencies
|
|
|
|
```bash
|
|
# Python packages
|
|
pip install playwright yt-dlp
|
|
|
|
# Playwright browser installation
|
|
playwright install chromium
|
|
```
|
|
|
|
## Common Patterns
|
|
|
|
### Anti-Bot Handling
|
|
- Use headed browser mode (visible window) initially
|
|
- Random delays between requests (2-5 seconds)
|
|
- Realistic viewport and user-agent settings
|
|
- Wait for `networkidle` state after navigation
|
|
|
|
### State Management
|
|
- JSON state files track downloaded episodes
|
|
- Enable `--resume` flag to skip completed downloads
|
|
- State includes error information for debugging
|
|
|
|
### Output Organization
|
|
```
|
|
{output_dir}/
|
|
├── {Season Name}/
|
|
│ ├── E01 - Episode Title.mp4
|
|
│ ├── E02 - Episode Title.mp4
|
|
│ └── download_state.json
|
|
```
|
|
|
|
## When to Use These Tools
|
|
|
|
- Downloading entire seasons of shows for offline viewing
|
|
- Archiving content before it becomes unavailable
|
|
- Building a local media library
|
|
|
|
## Legal Considerations
|
|
|
|
These tools are for personal archival use. Respect copyright laws in your jurisdiction.
|