MCP Hub
Back to servers

ShadowCrawl

Stealth web search and scraping MCP server with anti-bot support and research memory.

Stars
43
Forks
4
Updated
Feb 13, 2026
Validated
Feb 15, 2026
ShadowCrawl Logo

ShadowCrawl MCP

The Sovereign Stealth Intelligence Engine for AI Agents

Self-hosted Stealth Scraping & Federated Search for AI Agents. A 100% private, free alternative to Firecrawl, Jina Reader, and Tavily. Featuring Universal Anti-bot Bypass + Semantic Research Memory, Copy-Paste setup

License: MIT Rust MCP Status Sponsor

ShadowCrawl is built for AI agent workflows that need:

  • Reliable multi-source web search
  • Fast content extraction and website crawling
  • Structured data extraction from messy pages
  • Memory-aware research history with semantic recall
  • Robust anti-bot & JS-heavy site support (Browserless stealth, automation-marker cleanup, proxy rotation)
  • MCP-native usage over stdio and HTTP

If you want something you can run inside your own infra (Docker) and wire directly into Cursor/Claude via MCP, this repo is the “batteries included” baseline.

Current release status

  • Runtime version: v1.1.0
  • Release validation: stdio agent-mode + HTTP tool-path checks completed for release hardening
  • Service health endpoint: GET /health
  • Tool catalog endpoint: GET /mcp/tools
  • Tool call endpoint: POST /mcp/call

v1.1.0 highlights

  • Shared quality-policy utilities are now centralized and reused across search/scrape/crawl/batch handlers.
  • quality_mode is available at runtime (balanced / aggressive) across major extraction tools.
  • proxy_manager adds strict_proxy_health (non-strict diagnostics vs strict hard-fail behavior).
  • scrape_url/scrape_batch JSON defaults now omit raw HTML noise unless explicitly requested.
  • MCP stdio server path is validated for real agent-mode initialize/list/call flows.

Tool catalog (v1.0++)

The platform currently exposes 8 MCP tools:

  1. search_web — federated web search
  2. search_structured — search + top result scraping
  3. scrape_url — single URL extraction
  4. scrape_batch — multi-URL parallel scraping
  5. crawl_website — bounded recursive crawling
  6. extract_structured — schema-driven extraction
  7. research_history — semantic recall from prior runs
  8. proxy_manager — proxy list/status/switch/test/grab operations

Tip: all tools are available via both transports:

  • HTTP: GET /mcp/tools, POST /mcp/call
  • MCP stdio: shadowcrawl-mcp

Architecture

graph TD
   A[MCP Client / Agent] -->|stdio| B[shadowcrawl-mcp]
   A -->|HTTP| C[shadowcrawl HTTP]
   B --> D[MCP Tool Handlers]
   C --> D
   D --> E[tools module - search, scrape, batch, crawl, extract]
   D --> F[features module - history, proxies, antibot]
   E --> G[SearXNG]
   E --> H[Rust scraper]
   H --> K[Browserless optional]
   F --> I[Qdrant optional]
   F --> J[ip.txt proxy list]

Code layout (after refactor):

  • mcp-server/src/coreAppState, shared types
  • mcp-server/src/tools — search/scrape/crawl/extract/batch implementations
  • mcp-server/src/features — history (Qdrant), proxies, antibot helpers
  • mcp-server/src/nlp — query rewriting + rerank
  • mcp-server/src/mcp — HTTP/stdio transports + per-tool handlers + tool catalog
  • mcp-server/src/scraping — RustScraper and its internals

Anti-bot & JS-heavy site support

If you're evaluating paid scraping stacks, note: ShadowCrawl includes the same practical building blocks—self-hosted and customizable.

  • Browserless Chromium (optional, Docker) — JS-heavy rendering with stealth defaults
    • Included in the default stack and configurable via BROWSERLESS_URL and BROWSERLESS_TOKEN.
    • Supports session tokens, prebooted Chrome, and ad-blocking for more stable runs.
  • Stealth fingerprinting and automation-marker cleanup
    • Rotating user-agents, sec-ch-ua profiles, and stealth headers tuned for Browserless Chrome.
    • JS cleanup removes common automation markers (including Playwright/Puppeteer markers such as window.__playwright) before final extraction.
  • Human-like pacing & adaptive delays
    • Jittered request delays (SCRAPE_DELAY_PRESET, SCRAPE_DELAY_MIN_MS, SCRAPE_DELAY_MAX_MS) and boss-domain post-load delay to mimic human patterns.
  • Proxy-driven anti-bot bypass (high-security sites)
    • proxy_manager supports grab, list, status, switch, and test to maintain healthy proxy pools.
    • Supports multiple schemes (http, https, socks5) and per-request proxy selection/rotation.
    • Health checks and automatic switch logic help avoid blocked IPs.
    • For heavily protected targets, combine premium residential/ISP proxies with Browserless rendering and stealth headers.
  • Combinable strategies
    • Mix Browserless rendering, stealth headers, pacing, and proxy rotation to maximize success on protected targets.

Scrape Content Quality (Implemented)

Scope applied across scrape_url, scrape_batch, search_structured, and crawl_website outputs.

  1. HTML-to-Markdown conversion hardening

    • Normalizes <summary> to Markdown-style section headings.
    • Unwraps <details> containers while preserving inner text.
    • Cleans residual summary/details fragments in post-conversion markdown.
  2. Aggressive attribute stripping

    • Strips non-semantic attributes before extraction (class, id, style, aria-*, data-*, on*, role, tabindex).
    • Keeps semantic signals where needed in extracted results (href, src, alt, title).
    • Reduces UI/noise tokens that degrade LLM comprehension.
  3. Media handling in content preview

    • Keeps OG image metadata and now injects Markdown image context (![alt](https://raw.githubusercontent.com/devshero/shadowcrawl/main/url)) with fallback labels.
    • Exposes image hints directly in scrape_url text preview for better multimodal grounding.

Noise-Reduction Strategy (How we stay competitive)

  • Layered extraction pipeline: pre-clean HTML → readability/heuristics → fallback text-only extraction.
  • Boilerplate suppression: removes nav/footer/forms/ads and common noisy blocks before markdown conversion.
  • Semantic-first output: prioritizes main content, headings, code blocks, and canonical links over decorative markup.
  • Anti-block resilience: combines stealth headers, adaptive delays, Browserless rendering, and proxy rotation.
  • Quality gates: warnings and extraction score help detect weak/blocked content and trigger retries/fallbacks.

Quick start (Docker)

  1. Start stack
docker compose -f docker-compose-local.yml up -d --build
  1. Check health
curl -s http://localhost:5001/health
  1. Check tool surface
curl -s http://localhost:5001/mcp/tools

Quick start (local Rust)

Run HTTP server:

cd mcp-server
cargo run --release

Run MCP stdio:

cd mcp-server
cargo run --release --bin shadowcrawl-mcp

Proxy configuration (ip.txt + proxy_source.json)

This project uses ip.txt as the primary proxy list.

  • ip.txt (one proxy per line)
    • Examples:
      • http://1.2.3.4:8080
      • https://1.2.3.4:8443
      • socks5://1.2.3.4:1080
  • proxy_source.json (public sources that proxy_manager can fetch from)

With docker-compose-local.yml, the defaults are already wired:

  • IP_LIST_PATH=/home/appuser/ip.txt
  • PROXY_SOURCE_PATH=/home/appuser/proxy_source.json

And mounted from your repo root:

  • ./ip.txt:/home/appuser/ip.txt
  • ./proxy_source.json:/home/appuser/proxy_source.json

Using the proxy_manager tool

The proxy_manager MCP tool supports actions:

  • grab — fetch proxy lists from sources in proxy_source.json
  • list — list proxies currently in ip.txt
  • status — proxy manager stats (requires IP_LIST_PATH to exist)
  • switch — select best proxy
  • test — test a proxy against a target URL

MCP client configuration (stdio)

Use this in your VS Code/Cursor mcp.json:

{
   "servers": {
      "shadowcrawl": {
         "command": "docker",
         "args": [
            "compose",
            "-f",
            "/absolute/path/to/shadowcrawl/docker-compose-local.yml",
            "exec",
            "-i",
            "-T",
            "shadowcrawl",
            "shadowcrawl-mcp"
         ],
         "type": "stdio"
      }
   }
}

Sample results

Real tool outputs (copy/paste examples):

What changed from v0.3.x to v1.0++

  • Introduced production MCP surface with unified HTTP + stdio tool behavior
  • Added and validated proxy_manager operational workflow
  • Expanded tooling to full 8-tool set with end-to-end readiness checks
  • Improved server lifecycle handling for stdio MCP reliability
  • Centralized tool schema/catalog definitions to reduce drift
  • Added release validation artifacts and readiness reporting

Detailed notes: docs/RELEASE_NOTES_v1.0.0.md

Key environment variables

VariablePurposeDefault
SEARXNG_URLsearch backend URLhttp://searxng:8080
QDRANT_URLsemantic memory backendunset
BROWSERLESS_URLbrowser rendering backendunset
HTTP_TIMEOUT_SECSoutbound timeout30
HTTP_CONNECT_TIMEOUT_SECSconnect timeout10
OUTBOUND_LIMITconcurrency limiter32
IP_LIST_PATHproxy ip list pathip.txt
PROXY_SOURCE_PATHproxy source list pathproxy_source.json

Proxy config templates:

  • docs/examples/proxy_source.example.json NOTE: ip.txt is tracked in the repo root and is the canonical proxy list.

Comparison (quick decision guide)

This project is meant to be self-hosted infrastructure. A rough mental model:

If you use…You may prefer ShadowCrawl when…
Firecrawl / other hosted scraping APIsYou want local control (cost, privacy, networking), MCP-native integration, and can run Docker.
Jina Reader / “reader mode” servicesYou need more than reader conversion: crawling, batch mode, structured extraction, and a single MCP tool surface.
Browserless (alone)You want Browserless as an optional backend, but with a full tool suite (search/crawl/extract/proxy/history) around it.
Bright Data / proxy networksYou already have proxy sources, and want a Rust/MCP orchestration layer + rotation/health logic on top (this repo does not ship a proxy network).

Production checklist

  • Build passes (cargo check)
  • Release build passes (cargo build --release)
  • MCP tool surface is discoverable (/mcp/tools)
  • Core tools return expected payloads
  • Proxy manager commands operational
  • Health endpoint stable
  • Release validation artifact generated

Operational checklist document: docs/GA_REFACTOR_READINESS_2026-02-12.md

Repository map

🙏 Acknowledgments & Support

Built with ❤️ by a Solo Developer for the open-source community.

I'm actively maintaining this project to provide the best free search & scraping infrastructure for AI agents.

  • Found a bug? I'm happy to fix it! Please Open an Issue.
  • Want a new feature? Feature requests are welcome! Let me know what you need.
  • Love the project? Start the repo ⭐ or buy me a coffee to support development!

Sponsor

Special thanks to:

  • SearXNG Project for the incredible privacy-respecting search infrastructure.
  • Qdrant for the vector search engine.
  • Rust Community for the amazing tooling.

License

MIT License. Free to use for personal and commercial projects.

Reviews

No reviews yet

Sign in to write a review