AI-Driven Universal Web Data Extraction Platform
A production-grade, MCP-enabled universal web scraping platform with MongoDB storage and advanced anti-bot (antigravity) mechanisms.
🎯 Features
- Dual Scraping Engines: Static (Requests + BeautifulSoup) and Dynamic (Playwright)
- Auto-Detection: Automatically selects the appropriate scraper based on page content
- Anti-Bot Protection: User-Agent rotation, rate limiting, robots.txt compliance, stealth mode
- MongoDB Storage: Persists all scraped data with full metadata
- MCP Integration: Exposes scraping as tools for LLM invocation
- Export Options: JSON and CSV export capabilities
📁 Project Structure
d:\mcp\
├── requirements.txt # Python dependencies
├── config.py # Configuration settings
├── main.py # FastAPI MCP server entry point
├── scraper/
│ ├── static_scraper.py # Requests + BeautifulSoup scraper
│ ├── dynamic_scraper.py # Playwright scraper
│ └── strategy_selector.py # Auto-detection logic
├── antigravity/
│ ├── user_agents.py # User-Agent rotation
│ ├── throttle.py # Request delays & rate limiting
│ ├── robots_validator.py # robots.txt compliance
│ └── stealth.py # Playwright stealth configuration
├── database/
│ ├── mongodb.py # MongoDB connection & operations
│ └── models.py # Pydantic data models
├── mcp/
│ └── tools.py # MCP tool definitions
├── utils/
│ ├── normalizer.py # Data normalization
│ └── exporter.py # CSV/JSON export
├── tests/ # Test suite
└── docs/
└── README.md # This file
🚀 Quick Start
1. Install Dependencies
cd d:\mcp
pip install -r requirements.txt
playwright install chromium
2. Start MongoDB
Ensure MongoDB is running on localhost:27017 (or update MONGODB_URI in config.py).
3. Run the Server
python main.py
The server will start at http://localhost:8000.
4. Test the API
Open http://localhost:8000/docs for interactive Swagger documentation.
🔌 API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/scrape | POST/GET | Scrape a website |
/stats | GET | Get scraping statistics |
/recent | GET | Get recently scraped data |
/logs | GET | Get scrape logs |
/export/json | POST | Export data to JSON |
/export/csv | POST | Export data to CSV |
/health | GET | Health check |
Example Scrape Request
curl -X POST "http://localhost:8000/scrape" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "auto_detect": true}'
🧠 MCP Tool Usage
The platform exposes a scrape_website tool via MCP:
# Tool Schema
{
"name": "scrape_website",
"parameters": {
"url": "string (required)",
"dynamic": "boolean (default: false)",
"auto_detect": "boolean (default: true)",
"store_in_mongodb": "boolean (default: true)"
}
}
🛡️ Anti-Bot (Antigravity) Features
- User-Agent Rotation: 20+ realistic browser User-Agents
- Request Throttling: 1-5 second random delays between requests
- Rate Limiting: Max 10 requests per domain per minute
- robots.txt Compliance: Respects crawling restrictions
- Playwright Stealth Mode: Disables automation detection flags
📊 MongoDB Schema
scraped_data Collection
{
"_id": "ObjectId",
"url": "string",
"scraped_at": "ISO timestamp",
"scraper_type": "static | dynamic",
"content": {
"title": "string",
"text": "string",
"links": ["string"]
},
"metadata": {
"status_code": "number",
"response_time": "number",
"user_agent": "string"
}
}
scrape_logs Collection
{
"url": "string",
"timestamp": "ISO timestamp",
"success": "boolean",
"error": "string | null"
}
🧪 Running Tests
cd d:\mcp
pytest tests/ -v
⚖️ Ethical Considerations
- Always respects
robots.txtdirectives - Implements polite crawling with delays
- Only scrapes publicly accessible content
- Rate limiting prevents server overload
- Designed for responsible use
📋 Limitations
- Cannot bypass authentication or CAPTCHAs
- JavaScript-heavy SPAs may require dynamic scraping
- Some sites may detect and block scraping despite stealth measures
- Rate limiting may slow down bulk operations
🔮 Future Scope
- Proxy rotation support
- CAPTCHA solving integration
- Distributed scraping with task queues
- Advanced content extraction (structured data, tables)
- Scheduled/recurring scrapes
- WebSocket real-time updates
📄 License
This project is for educational purposes.