ShadowCrawl MCP
The Sovereign Stealth Intelligence Engine for AI Agents
Self-hosted Stealth Scraping & Federated Search for AI Agents. A 100% private, free alternative to Firecrawl, Jina Reader, and Tavily. Featuring Universal Anti-bot Bypass + Semantic Research Memory, Copy-Paste setup
ShadowCrawl is built for AI agent workflows that need:
- Reliable multi-source web search
- Fast content extraction and website crawling
- Structured data extraction from messy pages
- Memory-aware research history with semantic recall
- Robust anti-bot & JS-heavy site support (Browserless stealth, automation-marker cleanup, proxy rotation)
- MCP-native usage over stdio and HTTP
If you want something you can run inside your own infra (Docker) and wire directly into Cursor/Claude via MCP, this repo is the “batteries included” baseline.
Current release status
- Runtime version:
v1.1.0 - Release validation: stdio agent-mode + HTTP tool-path checks completed for release hardening
- Service health endpoint:
GET /health - Tool catalog endpoint:
GET /mcp/tools - Tool call endpoint:
POST /mcp/call
v1.1.0 highlights
- Shared quality-policy utilities are now centralized and reused across search/scrape/crawl/batch handlers.
quality_modeis available at runtime (balanced/aggressive) across major extraction tools.proxy_manageraddsstrict_proxy_health(non-strict diagnostics vs strict hard-fail behavior).scrape_url/scrape_batchJSON defaults now omit raw HTML noise unless explicitly requested.- MCP stdio server path is validated for real agent-mode initialize/list/call flows.
Tool catalog (v1.0++)
The platform currently exposes 8 MCP tools:
search_web— federated web searchsearch_structured— search + top result scrapingscrape_url— single URL extractionscrape_batch— multi-URL parallel scrapingcrawl_website— bounded recursive crawlingextract_structured— schema-driven extractionresearch_history— semantic recall from prior runsproxy_manager— proxy list/status/switch/test/grab operations
Tip: all tools are available via both transports:
- HTTP:
GET /mcp/tools,POST /mcp/call - MCP stdio:
shadowcrawl-mcp
Architecture
graph TD
A[MCP Client / Agent] -->|stdio| B[shadowcrawl-mcp]
A -->|HTTP| C[shadowcrawl HTTP]
B --> D[MCP Tool Handlers]
C --> D
D --> E[tools module - search, scrape, batch, crawl, extract]
D --> F[features module - history, proxies, antibot]
E --> G[SearXNG]
E --> H[Rust scraper]
H --> K[Browserless optional]
F --> I[Qdrant optional]
F --> J[ip.txt proxy list]
Code layout (after refactor):
mcp-server/src/core—AppState, shared typesmcp-server/src/tools— search/scrape/crawl/extract/batch implementationsmcp-server/src/features— history (Qdrant), proxies, antibot helpersmcp-server/src/nlp— query rewriting + rerankmcp-server/src/mcp— HTTP/stdio transports + per-tool handlers + tool catalogmcp-server/src/scraping— RustScraper and its internals
Anti-bot & JS-heavy site support
If you're evaluating paid scraping stacks, note: ShadowCrawl includes the same practical building blocks—self-hosted and customizable.
- Browserless Chromium (optional, Docker) — JS-heavy rendering with stealth defaults
- Included in the default stack and configurable via
BROWSERLESS_URLandBROWSERLESS_TOKEN. - Supports session tokens, prebooted Chrome, and ad-blocking for more stable runs.
- Included in the default stack and configurable via
- Stealth fingerprinting and automation-marker cleanup
- Rotating user-agents,
sec-ch-uaprofiles, and stealth headers tuned for Browserless Chrome. - JS cleanup removes common automation markers (including Playwright/Puppeteer markers such as
window.__playwright) before final extraction.
- Rotating user-agents,
- Human-like pacing & adaptive delays
- Jittered request delays (
SCRAPE_DELAY_PRESET,SCRAPE_DELAY_MIN_MS,SCRAPE_DELAY_MAX_MS) and boss-domain post-load delay to mimic human patterns.
- Jittered request delays (
- Proxy-driven anti-bot bypass (high-security sites)
proxy_managersupportsgrab,list,status,switch, andtestto maintain healthy proxy pools.- Supports multiple schemes (
http,https,socks5) and per-request proxy selection/rotation. - Health checks and automatic switch logic help avoid blocked IPs.
- For heavily protected targets, combine premium residential/ISP proxies with Browserless rendering and stealth headers.
- Combinable strategies
- Mix Browserless rendering, stealth headers, pacing, and proxy rotation to maximize success on protected targets.
Scrape Content Quality (Implemented)
Scope applied across scrape_url, scrape_batch, search_structured, and crawl_website outputs.
-
HTML-to-Markdown conversion hardening ✅
- Normalizes
<summary>to Markdown-style section headings. - Unwraps
<details>containers while preserving inner text. - Cleans residual summary/details fragments in post-conversion markdown.
- Normalizes
-
Aggressive attribute stripping ✅
- Strips non-semantic attributes before extraction (
class,id,style,aria-*,data-*,on*,role,tabindex). - Keeps semantic signals where needed in extracted results (
href,src,alt,title). - Reduces UI/noise tokens that degrade LLM comprehension.
- Strips non-semantic attributes before extraction (
-
Media handling in content preview ✅
- Keeps OG image metadata and now injects Markdown image context (
) with fallback labels. - Exposes image hints directly in
scrape_urltext preview for better multimodal grounding.
- Keeps OG image metadata and now injects Markdown image context (
Noise-Reduction Strategy (How we stay competitive)
- Layered extraction pipeline: pre-clean HTML → readability/heuristics → fallback text-only extraction.
- Boilerplate suppression: removes nav/footer/forms/ads and common noisy blocks before markdown conversion.
- Semantic-first output: prioritizes main content, headings, code blocks, and canonical links over decorative markup.
- Anti-block resilience: combines stealth headers, adaptive delays, Browserless rendering, and proxy rotation.
- Quality gates: warnings and extraction score help detect weak/blocked content and trigger retries/fallbacks.
Quick start (Docker)
- Start stack
docker compose -f docker-compose-local.yml up -d --build
- Check health
curl -s http://localhost:5001/health
- Check tool surface
curl -s http://localhost:5001/mcp/tools
Quick start (local Rust)
Run HTTP server:
cd mcp-server
cargo run --release
Run MCP stdio:
cd mcp-server
cargo run --release --bin shadowcrawl-mcp
Proxy configuration (ip.txt + proxy_source.json)
This project uses ip.txt as the primary proxy list.
ip.txt(one proxy per line)- Examples:
http://1.2.3.4:8080https://1.2.3.4:8443socks5://1.2.3.4:1080
- Examples:
proxy_source.json(public sources thatproxy_managercan fetch from)
With docker-compose-local.yml, the defaults are already wired:
IP_LIST_PATH=/home/appuser/ip.txtPROXY_SOURCE_PATH=/home/appuser/proxy_source.json
And mounted from your repo root:
./ip.txt:/home/appuser/ip.txt./proxy_source.json:/home/appuser/proxy_source.json
Using the proxy_manager tool
The proxy_manager MCP tool supports actions:
grab— fetch proxy lists from sources inproxy_source.jsonlist— list proxies currently inip.txtstatus— proxy manager stats (requiresIP_LIST_PATHto exist)switch— select best proxytest— test a proxy against a target URL
MCP client configuration (stdio)
Use this in your VS Code/Cursor mcp.json:
{
"servers": {
"shadowcrawl": {
"command": "docker",
"args": [
"compose",
"-f",
"/absolute/path/to/shadowcrawl/docker-compose-local.yml",
"exec",
"-i",
"-T",
"shadowcrawl",
"shadowcrawl-mcp"
],
"type": "stdio"
}
}
}
Sample results
Real tool outputs (copy/paste examples):
- sample-results/search_web.txt
- sample-results/search_structured_json.txt
- sample-results/scrape_url.txt
- sample-results/scrape_url_json.txt
- sample-results/scrape_batch_json.txt
- sample-results/crawl_website_json.txt
- sample-results/extract_structured_json.txt
- sample-results/research_history_json.txt
- sample-results/proxy_manager_json.txt
What changed from v0.3.x to v1.0++
- Introduced production MCP surface with unified HTTP + stdio tool behavior
- Added and validated
proxy_manageroperational workflow - Expanded tooling to full 8-tool set with end-to-end readiness checks
- Improved server lifecycle handling for stdio MCP reliability
- Centralized tool schema/catalog definitions to reduce drift
- Added release validation artifacts and readiness reporting
Detailed notes: docs/RELEASE_NOTES_v1.0.0.md
Key environment variables
| Variable | Purpose | Default |
|---|---|---|
SEARXNG_URL | search backend URL | http://searxng:8080 |
QDRANT_URL | semantic memory backend | unset |
BROWSERLESS_URL | browser rendering backend | unset |
HTTP_TIMEOUT_SECS | outbound timeout | 30 |
HTTP_CONNECT_TIMEOUT_SECS | connect timeout | 10 |
OUTBOUND_LIMIT | concurrency limiter | 32 |
IP_LIST_PATH | proxy ip list path | ip.txt |
PROXY_SOURCE_PATH | proxy source list path | proxy_source.json |
Proxy config templates:
docs/examples/proxy_source.example.jsonNOTE:ip.txtis tracked in the repo root and is the canonical proxy list.
Comparison (quick decision guide)
This project is meant to be self-hosted infrastructure. A rough mental model:
| If you use… | You may prefer ShadowCrawl when… |
|---|---|
| Firecrawl / other hosted scraping APIs | You want local control (cost, privacy, networking), MCP-native integration, and can run Docker. |
| Jina Reader / “reader mode” services | You need more than reader conversion: crawling, batch mode, structured extraction, and a single MCP tool surface. |
| Browserless (alone) | You want Browserless as an optional backend, but with a full tool suite (search/crawl/extract/proxy/history) around it. |
| Bright Data / proxy networks | You already have proxy sources, and want a Rust/MCP orchestration layer + rotation/health logic on top (this repo does not ship a proxy network). |
Production checklist
- Build passes (
cargo check) - Release build passes (
cargo build --release) - MCP tool surface is discoverable (
/mcp/tools) - Core tools return expected payloads
- Proxy manager commands operational
- Health endpoint stable
- Release validation artifact generated
Operational checklist document: docs/GA_REFACTOR_READINESS_2026-02-12.md
Repository map
- mcp-server/src — Rust core server and MCP handlers
- docs — release reports, architecture notes, operational guidance
- docs/IDE_SETUP.md — MCP client setup for popular IDEs/apps
- docs/SEARXNG_TUNING.md — tuning SearXNG for noise / bans
- searxng — SearXNG runtime configuration
- sample-results — sample outputs
🙏 Acknowledgments & Support
Built with ❤️ by a Solo Developer for the open-source community.
I'm actively maintaining this project to provide the best free search & scraping infrastructure for AI agents.
- Found a bug? I'm happy to fix it! Please Open an Issue.
- Want a new feature? Feature requests are welcome! Let me know what you need.
- Love the project? Start the repo ⭐ or buy me a coffee to support development!
Special thanks to:
- SearXNG Project for the incredible privacy-respecting search infrastructure.
- Qdrant for the vector search engine.
- Rust Community for the amazing tooling.
License
MIT License. Free to use for personal and commercial projects.