
🎬 YTPipe - AI-Native YouTube Processing Pipeline
Transform YouTube videos into LLM-ready knowledge bases with a production-ready MCP backend.
✨ Features
- 🤖 MCP Integration - 12 AI-callable tools for seamless agent integration
- 🎯 Smart Chunking - Semantic text chunking with timeline timestamps
- 🧠 Vector Embeddings - 384-dimensional embeddings for semantic search
- 🔍 Full-Text Search - Context-aware transcript search
- 📊 SEO Intelligence - AI-powered title, tag, and description optimization
- ⏱️ Timeline Analysis - Topic evolution and keyword density tracking
- 🏗️ Microservices - 11 independent, composable services
- 🔐 Type-Safe - Pydantic models throughout
- ⚡ Async-First - Non-blocking I/O operations
- 🗄️ Multi-Backend - ChromaDB, FAISS, Qdrant support
🚀 Quick Start
# Install
git clone https://github.com/leolech14/ytpipe.git
cd ytpipe
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# Process a video
ytpipe "https://youtube.com/watch?v=dQw4w9WgXcQ"
Result: Metadata + Transcript + Semantic Chunks + Embeddings + Vector Storage
🎯 Usage Examples
MCP Server (AI Agents)
python -m ytpipe.mcp.server
Then from Claude Code:
"Process this video: https://youtube.com/watch?v=VIDEO_ID"
"Search video dQw4w9WgXcQ for 'machine learning'"
"Optimize SEO for video dQw4w9WgXcQ"
CLI (Humans)
# Basic
ytpipe "https://youtube.com/watch?v=VIDEO_ID"
# Advanced
ytpipe URL --backend faiss --whisper-model large --verbose
Python API (Developers)
from ytpipe.core.pipeline import Pipeline
pipeline = Pipeline(output_dir="./output")
result = await pipeline.process(url)
print(f"✅ {result.metadata.title}")
print(f" Chunks: {len(result.chunks)}")
print(f" Time: {result.processing_time:.1f}s")
📋 MCP Tools
Pipeline (4 tools)
ytpipe_process_video- Full pipelineytpipe_download- Download onlyytpipe_transcribe- Transcribe audioytpipe_embed- Generate embeddings
Query (4 tools)
ytpipe_search- Full-text searchytpipe_find_similar- Semantic searchytpipe_get_chunk- Get chunk by IDytpipe_get_metadata- Get video info
Analytics (4 tools)
ytpipe_seo_optimize- SEO recommendationsytpipe_quality_report- Quality metricsytpipe_topic_timeline- Topic evolutionytpipe_benchmark- Performance analysis
🏗️ Architecture
MCP Server (12 tools) → Pipeline Orchestrator → 11 Services → Pydantic Models
Services:
- Extractors (2): Download, Transcriber
- Processors (4): Chunker, Embedder, VectorStore, Docling
- Intelligence (4): Search, SEO, Timeline, Analyzer
- Exporters (1): Dashboard
8 Processing Phases:
- Download → 2. Transcription → 3. Chunking → 4. Embeddings →
- Export → 6. Dashboard → 7. Docling → 8. Vector Storage
📊 Performance
| Metric | Value |
|---|---|
| Processing Speed | 4-13x real-time |
| Memory Usage | <2GB peak |
| Chunk Quality | 85%+ high quality |
| Embedding Dimension | 384 |
🔧 Requirements
- Python 3.8+
- FFmpeg (for audio extraction)
- 4GB+ RAM recommended
- GPU optional (CUDA for acceleration)
📖 Documentation
🤝 Contributing
Contributions welcome! Please read CONTRIBUTING.md first.
📝 License
MIT License - see LICENSE for details.
🙏 Credits
Built with:
- FastMCP - MCP server framework
- OpenAI Whisper - Speech-to-text
- sentence-transformers - Text embeddings
- Model Context Protocol - AI tool standard
📧 Contact
Leonardo Lech
- Email: leonardo.lech@gmail.com
- GitHub: @leolech14
⭐ Star this repo if you find it useful!
Transform YouTube → Knowledge Base in seconds