π₯· ShadowCrawl MCP β v3.1.0
Search Smarter. Scrape Anything. Block Nothing.
The God-Tier Intelligence Engine for AI Agents
The Sovereign, Self-Hosted Alternative to Firecrawl, Jina, and Tavily.
ShadowCrawl is not just a scraper or a search wrapper β it is a complete intelligence layer purpose-built for AI Agents. ShadowCrawl ships a native Rust meta-search engine running inside the same binary. Zero extra containers. Parallel engines. LLM-grade clean output.
When every other tool gets blocked, ShadowCrawl doesn't retreat β it escalates: native engines β native Chromium CDP headless β Human-In-The-Loop (HITL) nuclear option. You always get results.
β‘ God-Tier Internal Meta-Search (v3.0.0)
ShadowCrawl v3.0.0 ships a 100% Rust-native metasearch engine that queries 4 engines in parallel and fuses results intelligently:
| Engine | Coverage | Notes |
|---|---|---|
| π΅ DuckDuckGo | General Web | HTML scrape, no API key needed |
| π’ Bing | General + News | Best for current events |
| π΄ Google | Authoritative Results | High-relevance, deduped |
| π Brave Search | Privacy-Focused | Independent index, low overlap |
π§ What makes it God-Tier?
Parallel Concurrency β All 4 engines fire simultaneously. Total latency = slowest engine, not sum of all.
Smart Deduplication + Scoring β Cross-engine results are merged by URL fingerprint. Pages confirmed by 2+ engines receive a corroboration score boost. Domain authority weighting (docs, .gov, .edu, major outlets) pushes high-trust sources to the top.
Ultra-Clean Output for LLMs β Clean fields and predictable structure:
published_atis parsed and stored as a clean ISO-8601 field (2025-07-23T00:00:00)content/snippetis clean β zero date-prefix garbagebreadcrumbsextracted from URL path for navigation contextdomainandsource_typeauto-classified (blog,docs,reddit,news, etc.)
Result: LLMs receive dense, token-efficient, structured data β not a wall of noisy text.
Unstoppable Fallback β If an engine returns a bot-challenge page (anomaly.js, Cloudflare, PerimeterX), it is automatically retried via the native Chromium CDP instance (headless Chrome, bundled in-binary). No manual intervention. No 0-result failures.
Quality > Quantity β ~20 deduplicated, scored results rather than 50 raw duplicates. For an AI agent with a limited context window, 20 high-quality results outperform 50 noisy ones every time.
οΏ½ Deep Research Engine (v3.1.0)
ShadowCrawl v3.1.0 ships a self-contained multi-hop research pipeline as a first-class MCP tool β no external infra, no key required for local LLMs.
How it works
- Query Expansion β expands your question into multiple targeted sub-queries (3 axes: core concept, comparison/alternatives, implementation specifics)
- Parallel Search + Scrape β fires all sub-queries across 4 search engines; auto-scrapes top results (configurable depth 1β3, up to 20 sources)
- Semantic Filtering β Model2Vec-powered relevance scoring keeps only on-topic content chunks
- LLM Synthesis β condenses all findings into a zero-fluff Markdown fact-sheet via any OpenAI-compatible API
LLM Backend Options
| Backend | llm_base_url | Key required |
|---|---|---|
| OpenAI (default) | https://api.openai.com/v1 | Yes β OPENAI_API_KEY |
| Ollama (local) | http://localhost:11434/v1 | No |
| LM Studio (local) | http://localhost:1234/v1 | No |
| Any OpenAI-compatible proxy | custom URL | Optional |
Configuration (shadowcrawl.json)
Create shadowcrawl.json in the same directory as the binary (or repo root) to configure the engine β no rebuild needed. All fields are optional; env vars are used as fallback.
{
"deep_research": {
"enabled": true,
"llm_base_url": "http://localhost:11434/v1",
"llm_api_key": "",
"llm_model": "llama3",
"synthesis_enabled": true,
"synthesis_max_sources": 8,
"synthesis_max_chars_per_source": 2500,
"synthesis_max_tokens": 1024
}
}
Priority: shadowcrawl.json field β env var fallback β hardcoded default.
Build flags
# Full build (deep_research included by default)
cargo build --release
# Lean build β strip deep_research feature entirely
cargo build --release --no-default-features --features non_robot_search
The
deep-researchCargo feature is on by default. Use--no-default-featuresfor minimal deployments.
οΏ½π Full Feature Roster
| Feature | Details |
|---|---|
| οΏ½ Deep Research Engine | Multi-hop search + scrape + semantic filter + LLM synthesis (OpenAI / Ollama / LM Studio) |
| οΏ½π God-Tier Meta-Search | Parallel Google / Bing / DDG / Brave Β· dedup Β· scoring Β· breadcrumbs Β· published_at |
| π· Universal Scraper | Rust-native + native Chromium CDP for JS-heavy and anti-bot sites |
| π Human Auth (HITL) | human_auth_session: Real browser + persistent cookies + instruction overlay + Automatic Re-injection. Fetch any protected URL. |
| π§ Semantic Memory | Embedded LanceDB + Model2Vec for long-term research recall (no DB container) |
| π€ HITL Non-Robot Search | Visible Brave Browser + keyboard hooks for human CAPTCHA / login-wall bypass |
| π Deep Crawler | Recursive, bounded crawl to map entire subdomains |
| π Proxy Master | Native HTTP/SOCKS5 pool rotation with health checks |
| π§½ Universal Janitor | Strips cookie banners, popups, skeleton screens β delivers clean Markdown |
| π₯ Hydration Extractor | Resolves React/Next.js hydration JSON (__NEXT_DATA__, embedded state) |
| π‘ Anti-Bot Arsenal | Stealth UA rotation, fingerprint spoofing, CDP automation, mobile profile emulation |
| π Structured Extract | CSS-selector + prompt-driven field extraction from any page |
| π Batch Scrape | Parallel scrape of N URLs with configurable concurrency |
π Zero-Bloat Architecture
ShadowCrawl is pure binary: a single Rust executable exposes MCP tools (stdio) and an optional HTTP server β no Docker, no sidecars.
π The Nuclear Option: Human Auth Session (v3.0.0)
When standard automation fails (Cloudflare, CAPTCHA, complex logins), ShadowCrawl activates the human element.
π human_auth_session β The "Unblocker"
This is our signature tool that surpasses all competitors. While most scrapers fail on login-walled content, human_auth_session opens a real, visible browser window for you to solve the challenge.
Once you click FINISH & RETURN, all authentication cookies are transparently captured and persisted in ~/.shadowcrawl/sessions/. Subsequent requests to the same domain automatically inject these cookies β making future fetches fully automated and effortless.
- π’ Instruction Overlay β A native green banner guides the user on what to solve.
- πͺ Persistent Sessions β Solve once, scrape forever. No need to log in manually again for weeks.
- π‘ Security first β Cookies are stored locally and encrypted (optional/upcoming).
- π Auto-injection β Next
web_fetchorweb_crawlcalls automatically load found sessions.
π₯ Boss-Level Anti-Bot Evidence
We don't claim β we show receipts. All captured with human_auth_session and our advanced CDP engines (2026-02-20):
| Target | Protection | Evidence | Extracted |
|---|---|---|---|
| Cloudflare + Auth | JSON Β· Snippet | 60+ job listings β | |
| Ticketmaster | Cloudflare Turnstile | JSON Β· Snippet | Tour dates & venues β |
| Airbnb | DataDome | JSON Β· Snippet | 1,000+ Tokyo listings β |
| Upwork | reCAPTCHA | JSON Β· Snippet | 160K+ job postings β |
| Amazon | AWS Shield | JSON Β· Snippet | RTX 5070 Ti search results β |
| nowsecure.nl | Cloudflare | JSON | Manual button verified β |
π Full analysis: proof/README.md
π¦ Quick Start
Option A β Download Prebuilt Binaries (Recommended)
Download the latest release assets from GitHub Releases and run one of:
Prebuilt assets are published for: windows-x64, windows-arm64, linux-x64, linux-arm64.
shadowcrawl-mcpβ MCP stdio server (recommended for VS Code / Cursor / Claude Desktop)shadowcrawlβ HTTP server (default port5000; override via--port,PORT, orSHADOWCRAWL_PORT)
Confirm the HTTP server is alive:
./shadowcrawl --port 5000
curl http://localhost:5000/health
π§ͺ Build (Release, All Features)
Build all binaries with all optional features enabled:
cd mcp-server
cargo build --release --all-features
Option B β Build / Install from Source
git clone https://github.com/DevsHero/shadowcrawl.git
cd shadowcrawl
Build:
cd mcp-server
cargo build --release --features non_robot_search --bin shadowcrawl --bin shadowcrawl-mcp
Or install (puts binaries into your Cargo bin directory):
cargo install --path mcp-server --locked
Binaries land at:
target/release/shadowcrawlβ HTTP server (default port5000; override via--port,PORT, orSHADOWCRAWL_PORT)target/release/shadowcrawl-mcpβ MCP stdio server
Prerequisites for HITL:
- Brave Browser (brave.com/download)
- Accessibility permission (macOS: System Preferences β Privacy & Security β Accessibility)
- A desktop session (not SSH-only)
Platform guides: docs/window_setup.md Β· docs/ubuntu_setup.md
After any binary rebuild/update, restart your MCP client session to pick up new tool definitions.
β Agent Best Practices (ShadowCrawl Rules)
Use this exact decision flow to get the highest-quality results with minimal tokens:
memory_searchfirst (avoid re-fetching)web_search_jsonfor initial research (search + content summaries in one call)web_fetchfor specific URLs (docs/articles) -output_format="clean_json"for token-efficient output - setquery+strict_relevance=truewhen you want only query-relevant paragraphs- If
web_fetchreturns 403/429/rate-limit βproxy_controlgrabthen retry withuse_proxy=true - If
web_fetchreturnsauth_risk_score >= 0.4βvisual_scout(confirm login wall) βhuman_auth_session(The God-Tier Nuclear Option)
Structured extraction (schema-first):
- Prefer
fetch_then_extractfor one-shot fetch + extract. strict=true(default) enforces schema shape: missing arrays become[], missing scalars becomenull(no schema drift).- Treat
confidence=0.0as βplaceholder / unrendered pageβ (often JS-only like crates.io). Escalate to browser rendering (CDP/HITL) instead of trusting the fields. - π‘ New in v3.0.0: Placeholder detection is now scalar-only. Pure-array schemas (only lists/structs) never trigger confidence=0.0, fixing prior regressions.
clean_json notes:
- Large pages are truncated to respect
max_chars(look forclean_json_truncatedwarning). Increasemax_charsto see more. key_code_blocksis extracted from fenced blocks and signature-like inline code; short docs pages are supported.- π· v3.0.0 fix: Module extraction on
docs.rsworks recursively for all relative and absolute sub-paths.
π§© MCP Integration
ShadowCrawl exposes all tools via the Model Context Protocol (stdio transport).
VS Code / Copilot Chat
Add to your MCP config (~/.config/Code/User/mcp.json):
{
"servers": {
"shadowcrawl": {
"type": "stdio",
"command": "env",
"args": [
"RUST_LOG=info",
"SEARCH_ENGINES=google,bing,duckduckgo,brave",
"LANCEDB_URI=/YOUR_PATH/shadowcrawl/lancedb",
"HTTP_TIMEOUT_SECS=30",
"MAX_CONTENT_CHARS=10000",
"IP_LIST_PATH=/YOUR_PATH/shadowcrawl/ip.txt",
"PROXY_SOURCE_PATH=/YOUR_PATH/shadowcrawl/proxy_source.json",
"/YOUR_PATH/shadowcrawl/mcp-server/target/release/shadowcrawl-mcp"
]
}
}
}
Cursor / Claude Desktop
Use the same stdio setup as VS Code (run shadowcrawl-mcp locally and pass env vars via env or your clientβs env field).
π Full multi-IDE guide: docs/IDE_SETUP.md
βοΈ Key Environment Variables
| Variable | Default | Description |
|---|---|---|
CHROME_EXECUTABLE | auto-detected | Override path to Chromium/Chrome/Brave binary |
SEARCH_ENGINES | google,bing,duckduckgo,brave | Active search engines (comma-separated) |
SEARCH_MAX_RESULTS_PER_ENGINE | 10 | Results per engine before merge |
SEARCH_CDP_FALLBACK | true if browser found | Auto-retry blocked engines via native Chromium CDP (alias: SEARCH_BROWSERLESS_FALLBACK) |
SEARCH_SIMULATE_BLOCK | β | Force blocked path for testing: duckduckgo,bing or all |
LANCEDB_URI | β | Path for semantic research memory (optional) |
SHADOWCRAWL_NEUROSIPHON | 1 (enabled) | Set to 0 / false / off to disable all NeuroSiphon techniques (import nuking, SPA extraction, semantic shaving, search reranking) |
HTTP_TIMEOUT_SECS | 30 | Per-request timeout |
OUTBOUND_LIMIT | 32 | Max concurrent outbound connections |
MAX_CONTENT_CHARS | 10000 | Max chars per scraped document |
IP_LIST_PATH | β | Path to proxy IP list |
SCRAPE_DELAY_PRESET | polite | fast / polite / cautious |
DEEP_RESEARCH_ENABLED | 1 (enabled) | Set 0 to disable the deep_research tool at runtime (without rebuild) |
OPENAI_API_KEY | β | API key for LLM synthesis. Leave unset for key-less local endpoints (Ollama / LM Studio) |
OPENAI_BASE_URL | https://api.openai.com/v1 | LLM endpoint. Override for Ollama (http://localhost:11434/v1) or LM Studio (http://localhost:1234/v1). Config: deep_research.llm_base_url |
DEEP_RESEARCH_LLM_MODEL | gpt-4o-mini | Model name (e.g. llama3, mistral). Config: deep_research.llm_model |
DEEP_RESEARCH_SYNTHESIS | 1 (enabled) | Set 0 to run search + scrape only (skip LLM step). Config: deep_research.synthesis_enabled |
DEEP_RESEARCH_SYNTHESIS_MAX_SOURCES | 8 | Max source docs fed to LLM. Config: deep_research.synthesis_max_sources |
DEEP_RESEARCH_SYNTHESIS_MAX_CHARS_PER_SOURCE | 2500 | Max chars per source. Config: deep_research.synthesis_max_chars_per_source |
DEEP_RESEARCH_SYNTHESIS_MAX_TOKENS | 1024 | Max tokens in the LLM response. Tune per model: 512β1024 for small 4k-ctx models (e.g. lfm2-2.6b), 2048+ for large models. Config: deep_research.synthesis_max_tokens |
π Comparison
| Feature | Firecrawl / Jina / Tavily | ShadowCrawl v3.1.0 |
|---|---|---|
| Deep Research | None / paid add-on | Native: multi-hop + LLM synthesis (local or cloud) |
| Cost | $49β$499/mo | $0 β self-hosted |
| Privacy | They see your queries | 100% private, local-only |
| Search Engine | Proprietary / 3rd-party API | Native Rust (4 engines, parallel) |
| Result Quality | Mixed, noisy snippets | Deduped, scored, LLM-clean |
| Cloudflare Bypass | Rarely | Native Chromium CDP + HITL fallback |
| LinkedIn / Airbnb | Blocked | 99.99% success (HITL) |
| JS Rendering | Cloud API | Native Brave + bundled Chromium CDP |
| Semantic Memory | None | Embedded LanceDB + Model2Vec |
| Proxy Support | Paid add-on | Native SOCKS5/HTTP rotation |
| MCP Native | Partial | Full MCP stdio + HTTP |
π€ Agent Optimal Setup: IDE Copilot Instructions
ShadowCrawl works best when your AI agent knows the operational rules before it starts β which tool to call first, when to rotate proxies, and when not to use extract_structured. Without these rules, agents waste tokens re-fetching cached data and can misuse tools on incompatible sources.
The complete rules file lives at .github/copilot-instructions.md (VS Code / GitHub Copilot) and is also available as .clinerules for Cline. Copy the block below into the IDE-specific file for your editor.
ποΈ VS Code β .github/copilot-instructions.md
Create (or append to) .github/copilot-instructions.md in your workspace root:
## MCP Usage Guidelines β ShadowCrawl
### Shadowcrawl Priority Rules
1. **Memory first (NEVER skip):** ALWAYS call `research_history` BEFORE calling `search_web`,
`search_structured`, or `scrape_url`.
**Cache-quality guard:** only skip a live fetch when ALL of the following are true:
- similarity score β₯ 0.60
- entry_type is NOT "search" (search entries have no word_count β always follow up with scrape_url)
- word_count β₯ 50 (cached crates.io pages are JS-placeholders with ~11 words)
- no placeholder/sparse warnings (placeholder_page, short_content, content_restricted)
2. **Initial research:** use `search_structured` (search + content summaries in one call).
For private/internal tools not indexed publicly, skip search and go directly to
`scrape_url` on the known repo/docs URL.
3. **Doc/article pages:** `scrape_url` with `output_format: clean_json`,
`strict_relevance: true`, `query: "<your question>"`.
Raw `.md`/`.txt` URLs are auto-detected β HTML pipeline is skipped, raw content returned.
4. **Proxy rotation (mandatory on first block):** if `scrape_url` or `search_web` returns
403/429/rate-limit, immediately call `proxy_manager` with `action: "grab"` then retry
with `use_proxy: true`. Do NOT wait for a second failure.
4a. **Auto-escalation on low confidence:** if `scrape_url` returns confidence < 0.3 or
extraction_score < 0.4 β retry with `quality_mode: "aggressive"` β `visual_scout`
β `human_auth_session`. Never stay stuck on a low-confidence result.
5. **Schema extraction:** use `fetch_then_extract` (one-shot) or `extract_structured`.
Both auto-inject `raw_markdown_url` warning when called on raw file URLs.
Do NOT point at raw `.md`/`.json`/`.txt` unless intentional.
6. **Sub-page discovery:** use `crawl_website` before `scrape_url` when you only know
an index URL and need to find the right sub-page.
7. **Last resort:** `non_robot_search` only after direct fetch + proxy rotation have both
failed (Cloudflare / CAPTCHA / login walls). Session cookies are persisted after login.
πΎ Cursor β .cursorrules
Create or append to .cursorrules in your project root with the same block above.
π© Cline (VS Code extension) β .clinerules
Already included in this repository as .clinerules. Cline loads it automatically β no action needed.
π§ Claude Desktop β System Prompt / Custom Instructions
Paste the rules block into the Custom Instructions or System Prompt field in Claude Desktop settings (Settings β Advanced β System Prompt).
π§³ Other Agents (Windsurf, Aider, Continue, AutoGen, etc.)
Any agent that accepts a system prompt or workspace instruction file: paste the same block. The rules are plain markdown and tool-agnostic.
Quick Decision Flow
Question / research task
β
βΌ
research_history βββΊ hit (β₯ 0.60)? βββΊ cache-quality guard:
β miss β entry_type=="search"? βββΊ don't skip; do scrape_url
β β word_count < 50 or placeholder warnings? βββΊ don't skip
β ββββΊ quality OK? βββΊ use cached result, STOP
β
βΌ
search_structured βββΊ enough content? βββΊ use it, STOP
β need deeper page
βΌ
scrape_url (clean_json + strict_relevance + query)
β confidence < 0.3 or extraction_score < 0.4?
ββββΊ retry quality_mode: aggressive βββΊ visual_scout βββΊ human_auth_session
β 403/429/blocked? βββΊ proxy_manager grab βββΊ retry use_proxy: true
β still blocked? βββΊ non_robot_search (LAST RESORT)
β
βββ need schema JSON? βββΊ fetch_then_extract (schema + strict=true)
π Full rules + per-tool quick-reference table:
.github/copilot-instructions.md
v3.0.0 (2026-02-20)
Added
human_auth_session(The Nuclear Option): Launches a visible browser for human login/CAPTCHA solving. Captures and persists full authentication cookies to~/.shadowcrawl/sessions/{domain}.json. Enables full automation for protected URLs after a single manual session.- Instruction Overlay:
human_auth_sessionnow displays a custom green "ShadowCrawl" instruction banner on top of the browser window to guide users through complex auth walls. - Persistent Session Auto-Injection:
web_fetch,web_crawl, andvisual_scoutnow automatically check for and inject matching cookies from the local session store. extract_structured/fetch_then_extract: new optional paramsplaceholder_word_threshold(int, default 10) andplaceholder_empty_ratio(float 0β1, default 0.9) allow agents to tune placeholder detection sensitivity per-call.web_crawl: new optionalmax_charsparam (default 10 000) caps total JSON output size to prevent workspace storage spill.- Rustdoc module extraction:
extract_structured/fetch_then_extractcorrectly populatemodules: [...]on docs.rs pages using theNAME/index.htmlsub-directory convention. - GitHub Discussions & Issues hydration:
fetch_via_cdpdetectsgithub.com/*/discussions/*and/issues/*URLs; extends network-idle window to 2.5 s / 12 s max and polls for.timeline-comment,.js-discussion,.comment-bodyDOM nodes. - Contextual code blocks (
clean_jsonmode):SniperCodeBlockgains acontext: Option<String>field. Performs two-pass extraction for prose preceding fenced blocks and Markdown sentences containing inline snippets. - IDE copilot-instructions guide (README): new
π€ Agent Optimal Setupsection. .clinerulesworkspace file: all 7 priority rules + decision-flow diagram + per-tool quick-reference table.- Agent priority rules in tool schemas: every MCP tool description now carries machine-readable
β οΈ AGENT RULE/β BEST PRACTICE.
Changed
- Placeholder detection (Scalar-Only Logic): Confidence override to 0.0 now only considers scalar (non-array) fields. Pure-array schemas (headers, modules, structs) never trigger fake placeholder warnings, fixing false-positives on rich but list-heavy documentation pages.
web_fetch(output_format="clean_json"): applies amax_chars-based paragraph budget and emitsclean_json_truncatedwhen output is clipped.extract_fields/fetch_then_extract: placeholder/unrendered pages (very low content + mostly empty schema fields) forceconfidence=0.0.- Short-content bypass (
strict_relevance/extract_relevant_sections): early exit with a descriptive warning whenword_count < 200. Short pages (GitHub Discussions, Q&A threads) are returned whole.
Fixed
- BUG-6:
modules: []always empty on rustdoc pages β refactored regex to support both absolute and simple relative module links (init/index.html,optim/index.html). - BUG-7: false-positive
confidence=0.0on real docs.rs pages; replaced whole-schema empty ratio with scalar-only ratio + raised threshold. - BUG-9:
web_crawlcould spill 16 KB+ of JSON into VS Code workspace storage; handler now truncates response tomax_chars(default 10 000). web_fetch(output_format="clean_json"): paragraph filter now adapts forword_count < 200.fetch_then_extract: prevents false-high confidence on JS-only placeholder pages (e.g. crates.io) by overriding confidence to 0.0.cdp_fallback_failedon GitHub Discussions: extended CDP hydration window and selector polling ensures full thread capture.
β Acknowledgments & Support
ShadowCrawl is built with β€οΈ by a solo developer for the open-source AI community. If this tool saved you from a $500/mo scraping API bill:
- β Star the repo β it helps others discover this
- π Found a bug? Open an issue
- π‘ Feature request? Start a discussion
- β Fuel more updates:
License: MIT β free for personal and commercial use.