ShadowCrawl MCP

The Sovereign Stealth Intelligence Engine for AI Agents

Self-hosted Stealth Scraping & Federated Search for AI Agents. A 100% private, free alternative to Firecrawl, Jina Reader, and Tavily. Featuring Universal Anti-bot Bypass + Semantic Research Memory, Copy-Paste setup

ShadowCrawl is built for AI agent workflows that need:

Reliable multi-source web search
Fast content extraction and website crawling
Structured data extraction from messy pages
Memory-aware research history with semantic recall
Robust anti-bot & JS-heavy site support (Browserless stealth, automation-marker cleanup, proxy rotation)
MCP-native usage over stdio and HTTP

If you want something you can run inside your own infra (Docker) and wire directly into Cursor/Claude via MCP, this repo is the “batteries included” baseline.

Current release status

Runtime version: v1.1.0
Release validation: stdio agent-mode + HTTP tool-path checks completed for release hardening
Service health endpoint: GET /health
Tool catalog endpoint: GET /mcp/tools
Tool call endpoint: POST /mcp/call

v1.1.0 highlights

Shared quality-policy utilities are now centralized and reused across search/scrape/crawl/batch handlers.
quality_mode is available at runtime (balanced / aggressive) across major extraction tools.
proxy_manager adds strict_proxy_health (non-strict diagnostics vs strict hard-fail behavior).
scrape_url/scrape_batch JSON defaults now omit raw HTML noise unless explicitly requested.
MCP stdio server path is validated for real agent-mode initialize/list/call flows.

Tool catalog (v1.0++)

The platform currently exposes 8 MCP tools:

search_web — federated web search
search_structured — search + top result scraping
scrape_url — single URL extraction
scrape_batch — multi-URL parallel scraping
crawl_website — bounded recursive crawling
extract_structured — schema-driven extraction
research_history — semantic recall from prior runs
proxy_manager — proxy list/status/switch/test/grab operations

Tip: all tools are available via both transports:

HTTP: GET /mcp/tools, POST /mcp/call
MCP stdio: shadowcrawl-mcp

Architecture

graph TD
   A[MCP Client / Agent] -->|stdio| B[shadowcrawl-mcp]
   A -->|HTTP| C[shadowcrawl HTTP]
   B --> D[MCP Tool Handlers]
   C --> D
   D --> E[tools module - search, scrape, batch, crawl, extract]
   D --> F[features module - history, proxies, antibot]
   E --> G[SearXNG]
   E --> H[Rust scraper]
   H --> K[Browserless optional]
   F --> I[Qdrant optional]
   F --> J[ip.txt proxy list]

Code layout (after refactor):

mcp-server/src/core — AppState, shared types
mcp-server/src/tools — search/scrape/crawl/extract/batch implementations
mcp-server/src/features — history (Qdrant), proxies, antibot helpers
mcp-server/src/nlp — query rewriting + rerank
mcp-server/src/mcp — HTTP/stdio transports + per-tool handlers + tool catalog
mcp-server/src/scraping — RustScraper and its internals

Anti-bot & JS-heavy site support

If you're evaluating paid scraping stacks, note: ShadowCrawl includes the same practical building blocks—self-hosted and customizable.

Browserless Chromium (optional, Docker) — JS-heavy rendering with stealth defaults
- Included in the default stack and configurable via BROWSERLESS_URL and BROWSERLESS_TOKEN.
- Supports session tokens, prebooted Chrome, and ad-blocking for more stable runs.
Stealth fingerprinting and automation-marker cleanup
- Rotating user-agents, sec-ch-ua profiles, and stealth headers tuned for Browserless Chrome.
- JS cleanup removes common automation markers (including Playwright/Puppeteer markers such as window.__playwright) before final extraction.
Human-like pacing & adaptive delays
- Jittered request delays (SCRAPE_DELAY_PRESET, SCRAPE_DELAY_MIN_MS, SCRAPE_DELAY_MAX_MS) and boss-domain post-load delay to mimic human patterns.
Proxy-driven anti-bot bypass (high-security sites)
- proxy_manager supports grab, list, status, switch, and test to maintain healthy proxy pools.
- Supports multiple schemes (http, https, socks5) and per-request proxy selection/rotation.
- Health checks and automatic switch logic help avoid blocked IPs.
- For heavily protected targets, combine premium residential/ISP proxies with Browserless rendering and stealth headers.
Combinable strategies
- Mix Browserless rendering, stealth headers, pacing, and proxy rotation to maximize success on protected targets.

Scrape Content Quality (Implemented)

Scope applied across scrape_url, scrape_batch, search_structured, and crawl_website outputs.

HTML-to-Markdown conversion hardening ✅
- Normalizes <summary> to Markdown-style section headings.
- Unwraps <details> containers while preserving inner text.
- Cleans residual summary/details fragments in post-conversion markdown.
Aggressive attribute stripping ✅
- Strips non-semantic attributes before extraction (class, id, style, aria-*, data-*, on*, role, tabindex).
- Keeps semantic signals where needed in extracted results (href, src, alt, title).
- Reduces UI/noise tokens that degrade LLM comprehension.
Media handling in content preview ✅
- Keeps OG image metadata and now injects Markdown image context (![alt](https://raw.githubusercontent.com/devshero/shadowcrawl/main/url)) with fallback labels.
- Exposes image hints directly in scrape_url text preview for better multimodal grounding.

Noise-Reduction Strategy (How we stay competitive)

Layered extraction pipeline: pre-clean HTML → readability/heuristics → fallback text-only extraction.
Boilerplate suppression: removes nav/footer/forms/ads and common noisy blocks before markdown conversion.
Semantic-first output: prioritizes main content, headings, code blocks, and canonical links over decorative markup.
Anti-block resilience: combines stealth headers, adaptive delays, Browserless rendering, and proxy rotation.
Quality gates: warnings and extraction score help detect weak/blocked content and trigger retries/fallbacks.

Quick start (Docker)

Start stack

docker compose -f docker-compose-local.yml up -d --build

Check health

curl -s http://localhost:5001/health

Check tool surface

curl -s http://localhost:5001/mcp/tools

Quick start (local Rust)

Run HTTP server:

cd mcp-server
cargo run --release

Run MCP stdio:

cd mcp-server
cargo run --release --bin shadowcrawl-mcp

Proxy configuration (ip.txt + proxy_source.json)

This project uses ip.txt as the primary proxy list.

ip.txt (one proxy per line)
- Examples:
  - http://1.2.3.4:8080
  - https://1.2.3.4:8443
  - socks5://1.2.3.4:1080
proxy_source.json (public sources that proxy_manager can fetch from)

With docker-compose-local.yml, the defaults are already wired:

IP_LIST_PATH=/home/appuser/ip.txt
PROXY_SOURCE_PATH=/home/appuser/proxy_source.json

And mounted from your repo root:

./ip.txt:/home/appuser/ip.txt
./proxy_source.json:/home/appuser/proxy_source.json

Using the proxy_manager tool

The proxy_manager MCP tool supports actions:

grab — fetch proxy lists from sources in proxy_source.json
list — list proxies currently in ip.txt
status — proxy manager stats (requires IP_LIST_PATH to exist)
switch — select best proxy
test — test a proxy against a target URL

MCP client configuration (stdio)

Use this in your VS Code/Cursor mcp.json:

{
   "servers": {
      "shadowcrawl": {
         "command": "docker",
         "args": [
            "compose",
            "-f",
            "/absolute/path/to/shadowcrawl/docker-compose-local.yml",
            "exec",
            "-i",
            "-T",
            "shadowcrawl",
            "shadowcrawl-mcp"
         ],
         "type": "stdio"
      }
   }
}

Sample results

Real tool outputs (copy/paste examples):

What changed from v0.3.x to v1.0++

Introduced production MCP surface with unified HTTP + stdio tool behavior
Added and validated proxy_manager operational workflow
Expanded tooling to full 8-tool set with end-to-end readiness checks
Improved server lifecycle handling for stdio MCP reliability
Centralized tool schema/catalog definitions to reduce drift
Added release validation artifacts and readiness reporting

Detailed notes: docs/RELEASE_NOTES_v1.0.0.md

Key environment variables

Variable	Purpose	Default
`SEARXNG_URL`	search backend URL	`http://searxng:8080`
`QDRANT_URL`	semantic memory backend	unset
`BROWSERLESS_URL`	browser rendering backend	unset
`HTTP_TIMEOUT_SECS`	outbound timeout	`30`
`HTTP_CONNECT_TIMEOUT_SECS`	connect timeout	`10`
`OUTBOUND_LIMIT`	concurrency limiter	`32`
`IP_LIST_PATH`	proxy ip list path	`ip.txt`
`PROXY_SOURCE_PATH`	proxy source list path	`proxy_source.json`

Proxy config templates:

docs/examples/proxy_source.example.json NOTE: ip.txt is tracked in the repo root and is the canonical proxy list.

Comparison (quick decision guide)

This project is meant to be self-hosted infrastructure. A rough mental model:

If you use…	You may prefer ShadowCrawl when…
Firecrawl / other hosted scraping APIs	You want local control (cost, privacy, networking), MCP-native integration, and can run Docker.
Jina Reader / “reader mode” services	You need more than reader conversion: crawling, batch mode, structured extraction, and a single MCP tool surface.
Browserless (alone)	You want Browserless as an optional backend, but with a full tool suite (search/crawl/extract/proxy/history) around it.
Bright Data / proxy networks	You already have proxy sources, and want a Rust/MCP orchestration layer + rotation/health logic on top (this repo does not ship a proxy network).

Production checklist

Build passes (cargo check)
Release build passes (cargo build --release)
MCP tool surface is discoverable (/mcp/tools)
Core tools return expected payloads
Proxy manager commands operational
Health endpoint stable
Release validation artifact generated

Operational checklist document: docs/GA_REFACTOR_READINESS_2026-02-12.md

Repository map

mcp-server/src — Rust core server and MCP handlers
docs — release reports, architecture notes, operational guidance
docs/IDE_SETUP.md — MCP client setup for popular IDEs/apps
docs/SEARXNG_TUNING.md — tuning SearXNG for noise / bans
searxng — SearXNG runtime configuration
sample-results — sample outputs

🙏 Acknowledgments & Support

Built with ❤️ by a Solo Developer for the open-source community.

I'm actively maintaining this project to provide the best free search & scraping infrastructure for AI agents.

Found a bug? I'm happy to fix it! Please Open an Issue.
Want a new feature? Feature requests are welcome! Let me know what you need.
Love the project? Start the repo ⭐ or buy me a coffee to support development!

Special thanks to:

SearXNG Project for the incredible privacy-respecting search infrastructure.
Qdrant for the vector search engine.
Rust Community for the amazing tooling.

License

MIT License. Free to use for personal and commercial projects.

ShadowCrawl