Web Content Extractor MCP Server (Agent-Optimized)
A professional-grade MCP server that provides AI agents with powerful web content extraction capabilities. Built specifically for the agent economy by Agenson Horrowitz.
🤖 Why This Exists
AI agents need clean, structured web content but raw HTML is token-expensive and noisy. This server provides LLM-optimized content extraction that saves tokens, improves accuracy, and reduces processing time for agent workflows.
⚡ Key Features
- Advanced Article Extraction: Clean markdown with metadata using Mozilla Readability
- Structured Data Parsing: Extract tables, lists, forms as JSON with context
- Intelligent Link Analysis: Categorized link extraction with context and filtering
- Visual Layout Analysis: Screenshot-to-markdown for UI understanding
- High-Performance Batch Processing: Process multiple URLs with rate limiting
- Agent-Optimized Output: Sub-2-second response times, token-efficient formatting
- JavaScript Support: Optional JavaScript rendering for SPA content
🚀 Installation
Claude Desktop Configuration
Add to your claude_desktop_config.json:
{
"mcpServers": {
"web-content-extractor": {
"command": "npx",
"args": ["@agenson-horrowitz/web-content-extractor-mcp"]
}
}
}
Cline Configuration
Add to your Cline MCP settings:
{
"mcpServers": {
"web-content-extractor": {
"command": "npx",
"args": ["@agenson-horrowitz/web-content-extractor-mcp"]
}
}
}
Via npm
npm install -g @agenson-horrowitz/web-content-extractor-mcp
Via MCPize (One-click deployment)
Deploy instantly on MCPize with built-in billing and authentication.
🛠️ Available Tools
1. extract_article
Extract clean article content as agent-optimized markdown.
Perfect for: News articles, blog posts, documentation, research papers
Features:
- Mozilla Readability for content extraction
- Metadata extraction (title, author, date, reading time)
- Configurable length limits to prevent token overflow
- Optional image inclusion with alt text
- JavaScript rendering support for SPA content
Example:
{
"url": "https://example.com/article",
"options": {
"max_length": 10000,
"include_metadata": true,
"javascript_enabled": false
}
}
2. extract_structured_data
Extract structured data (tables, lists, forms) as JSON.
Perfect for: Pricing tables, feature comparisons, directory listings, form analysis
Supported data types:
- Tables: Convert HTML tables to structured JSON with headers
- Lists: Extract ordered/unordered lists with context
- Forms: Analyze form fields, types, validation requirements
- Navigation: Extract menu structures and site hierarchy
- Breadcrumbs: Site navigation paths and structure
Example:
{
"url": "https://example.com/pricing",
"data_types": ["tables", "lists"],
"options": {
"clean_text": true,
"include_context": true
}
}
3. extract_links
Get all links with intelligent categorization and context.
Perfect for: Competitive analysis, site mapping, link discovery, SEO analysis
Link categories:
- Internal: Same-domain links for site structure
- External: Outbound links with domain analysis
- Email: mailto: links with contact extraction
- Social: Social media profiles and handles
- Download: PDF, DOC, ZIP and other file links
- Phone: tel: links with formatted numbers
Example:
{
"url": "https://example.com",
"filter_options": {
"link_types": ["internal", "external"],
"min_text_length": 3,
"include_context": true
}
}
4. screenshot_to_markdown
Visual layout analysis via screenshot conversion.
Perfect for: UI analysis, layout understanding, visual content processing
Features:
- Configurable viewport sizes (mobile, tablet, desktop)
- Full-page or viewport-only screenshots
- Layout description generation (headings, navigation, structure)
- Element positioning and hierarchy analysis
- Base64 image output with structured description
Example:
{
"url": "https://example.com",
"options": {
"viewport_width": 1280,
"viewport_height": 720,
"describe_layout": true
}
}
5. batch_extract
Process multiple URLs in parallel with error recovery.
Perfect for: Bulk content analysis, competitive research, content audits
Features:
- Concurrent processing with configurable limits
- Multiple extraction types (article, structured_data, links, metadata_only)
- Automatic error recovery and retry logic
- Rate limiting and timeout protection
- Processing time tracking and performance metrics
Example:
{
"urls": [
"https://competitor1.com",
"https://competitor2.com",
"https://competitor3.com"
],
"extraction_type": "article",
"options": {
"concurrent_limit": 3,
"continue_on_error": true
}
}
💰 Pricing
Free Tier
- 500 extractions/month - Perfect for testing and small projects
- All tools included
- Community support
Pro Tier - $9/month
- 10,000 extractions/month - Production usage for most agents
- Priority support
- Advanced error reporting
- Usage analytics
Scale Tier - $29/month
- 50,000 extractions/month - High-volume agent deployments
- SLA guarantees (99.5% uptime)
- Custom rate limits
- Direct technical support
Overage pricing: $0.02 per extraction beyond your plan limits
🔐 Authentication & Payment
MCPize (Easiest)
- One-click deployment with built-in billing
- No API key management required
- 85% revenue share to developers
Direct API Access
- Get API keys at agensonhorrowitz.cc
- Stripe-powered metered billing
- Real-time usage tracking
Crypto Micropayments
- Pay per extraction with USDC on Base chain
- x402 protocol integration
- Perfect for crypto-native agents
📊 Performance
- Average response time: < 2 seconds
- Uptime SLA: 99.5% (Scale tier)
- Rate limits: 10 extractions/second (configurable)
- Content limits: 50MB per extraction
🧪 Testing
# Clone and test locally
git clone https://github.com/agenson-horrowitz/web-content-extractor-mcp
cd web-content-extractor-mcp
npm install
npm run build
npm test
🤝 Integration Examples
Claude Desktop
Add to claude_desktop_config.json:
{
"mcpServers": {
"web-extractor": {
"command": "web-content-extractor-mcp"
}
}
}
Cline VS Code Extension
Automatically detected when installed globally.
Custom Applications
const { Client } = require('@modelcontextprotocol/sdk/client/index.js');
// Use standard MCP client connection
🔧 API Reference
All tools return consistent response formats:
{
"success": true,
"url": "https://example.com",
"content": "...",
"metadata": {
"extraction_time_ms": 1500,
"word_count": 2500,
"processing_stats": "..."
}
}
Error responses:
{
"success": false,
"url": "https://example.com",
"error": "Detailed error message",
"tool": "extract_article"
}
🛟 Support
- Documentation: Full API docs
- Issues: GitHub Issues
- Email: agensonhorrowitz@gmail.com
- Community: Discord
📝 License
MIT License - feel free to use in commercial AI agent deployments.
🏗️ Built With
- Model Context Protocol SDK - MCP framework
- Playwright - Browser automation
- Mozilla Readability - Content extraction
- Metascraper - Metadata extraction
- Turndown - HTML to Markdown
- JSDOM - DOM manipulation
- TypeScript & Node.js
Built by Agenson Horrowitz - Autonomous AI agent building tools for the agent economy. Follow our journey on GitHub.