MCP Hub
Back to servers

CRW Web Scraper

Open-source web scraper for AI agents with scrape, crawl, and map tools

Registry
Stars
17
Forks
1
Updated
Mar 28, 2026
Validated
Mar 30, 2026

fastCRW

The web scraper built for AI agents. Single binary. Zero config.

crates.io CI License GitHub Stars fastcrw.com Discord

Works with: Claude Code · Cursor · Windsurf · Cline · Copilot · Continue.dev · Codex

MCP IntegrationInstallationAPI ReferenceCloudJS RenderingConfigurationDiscord

English | 中文


Don't want to self-host? fastcrw.com is the managed cloud — global proxy network, auto-scaling, dashboard, and API keys. Same Firecrawl-compatible API. Get 500 free credits →

CRW is the open-source web scraper built for AI agents. Built-in MCP server (stdio + HTTP), single binary, ~6 MB idle RAM. Give Claude Code, Cursor, or any MCP client web scraping superpowers in 30 seconds. Firecrawl-compatible API — 5.5x faster, 75x less memory, 92% coverage on 1K real-world URLs.

Built-in MCP server. Single binary. No Redis. No Node.js.

# With npm (no Rust required):
npx crw-mcp

# Or with Cargo:
cargo install crw-mcp

# Add to Claude Code:
claude mcp add crw -- npx crw-mcp

Listed on the MCP Registry

What's New

0.2.0 (2026-03-28)

Features

  • add MCP Registry support for official server discovery (154b9f5)

0.1.2 (2026-03-27)

Bug Fixes

  • vendor pdf-inspector as crw-pdf for crates.io publishability (3f7681d)

0.1.1 (2026-03-26)

Bug Fixes

  • skip already-published crates without masking real errors (010649c)

Full changelog →

Why CRW?

CRW gives you Firecrawl's API with a fraction of the resource usage. No runtime dependencies, no Redis, no Node.js — just a single binary you can deploy anywhere.

MetricCRW (self-hosted)fastcrw.com (cloud)FirecrawlCrawl4AISpider
Coverage (1K URLs)92.0%92.0%77.2%99.9%
Avg Latency833ms833ms4,600ms
P50 Latency446ms446ms45ms (static)
Noise Rejection88.4%88.4%noise 6.8%noise 11.3%noise 4.2%
Idle RAM6.6 MB0 (managed)~500 MB+cloud-only
Cold start85 ms0 (always-on)30–60 s
HTTP scrape~30 ms~30 ms~200 ms+~480 ms~45 ms
Proxy networkBYOGlobal (built-in)Built-inCloud-only
Cost / 1K scrapes$0 (self-hosted)From $13/mo$0.83–5.33$0$0.65
Dependenciessingle binaryNone (API)Node + Redis + PG + RabbitMQPython + PlaywrightRust / cloud
LicenseAGPL-3.0ManagedAGPL-3.0Apache-2.0MIT
Full benchmark details

CRW vs Firecrawl — Tested on Firecrawl scrape-content-dataset-v1 (1,000 real-world URLs, JS rendering enabled):

  • CRW covers 92% of URLs vs Firecrawl's 77.2% — 15 percentage points higher
  • CRW is 5.5x faster on average (833ms vs 4,600ms)
  • CRW uses ~75x less RAM at idle (6.6 MB vs ~500 MB+)
  • Firecrawl requires 5 containers (Node.js, Redis, PostgreSQL, RabbitMQ, Playwright) — CRW is a single binary

Crawl4AI vs Firecrawl vs SpiderIndependent benchmark by Spider.cloud:

MetricSpiderFirecrawlCrawl4AI
Static throughput182 p/s27 p/s19 p/s
Success (static)100%99.5%99%
Success (SPA)100%96.6%93.7%
Success (anti-bot)99.6%88.4%72%
Latency (static)45ms310ms480ms
Latency (SPA)820ms1,400ms1,650ms

Firecrawl independent reviewScrapeway benchmark: 64.3% success rate, $5.11/1K scrapes, 0% on LinkedIn/Twitter.

Resource comparison:

MetricCRWFirecrawl
Min RAM~7 MB4 GB
Recommended RAM~64 MB (under load)8–16 GB
Docker imagessingle ~8 MB binary~2–3 GB total
Cold start85 ms30–60 seconds
Containers needed1 (+optional sidecar)5

Features

  • 🔧 MCP server — built-in stdio + HTTP transport for Claude Code, Cursor, Windsurf, and any MCP client
  • 🔌 Firecrawl-compatible API — same endpoint family and familiar request/response ergonomics
  • 📄 6 output formats — markdown, HTML, cleaned HTML, raw HTML, plain text, links, structured JSON
  • 🤖 LLM structured extraction — send a JSON schema, get validated structured data back (Anthropic tool_use + OpenAI function calling)
  • 🌐 JS rendering — auto-detect SPAs with shell heuristics, render via LightPanda, Playwright, or Chrome (CDP)
  • 🕷️ BFS crawler — async crawl with rate limiting, robots.txt, sitemap support, concurrent jobs
  • 🔒 Security — SSRF protection (private IPs, cloud metadata, IPv6), constant-time auth, dangerous URI filtering
  • 🐳 Docker ready — multi-stage build with LightPanda sidecar
  • 🎯 CSS selector & XPath — extract specific DOM elements before Markdown conversion
  • ✂️ Chunking & filtering — split content into topic/sentence/regex chunks; rank by BM25 or cosine similarity
  • 🕵️ Stealth mode — browser-like UA rotation and header injection to reduce bot detection
  • 🌐 Per-request proxy — override the global proxy per scrape request

Cloud vs Self-Hosted

FeatureSelf-hostedCloud (fastcrw.com)
Setupcargo install crw-serverSign up → get API key
InfrastructureYou manageFully managed
ProxyBring your ownGlobal proxy network
ScalingManualAuto-scaling
APIFirecrawl-compatibleSame Firecrawl-compatible API

Both use the same Firecrawl-compatible API — your code works with either. Switch between self-hosted and cloud by changing the base URL.

Quick Start

MCP (AI agents — recommended):

cargo install crw-mcp
claude mcp add crw -- crw-mcp

That's it. Claude Code now has crw_scrape, crw_crawl, crw_map tools. For Cursor, Windsurf, Cline, and other MCP clients, see MCP Server.

CLI (no server needed):

cargo install crw-cli
crw https://example.com

Self-hosted server:

cargo install crw-server
crw-server

Enable JS rendering (optional):

crw-server setup

This downloads LightPanda and creates a config.local.toml for JS rendering. See JS Rendering for details.

Cloud (no setup):

curl -X POST https://fastcrw.com/api/v1/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

Get your API key at fastcrw.com — 500 free credits included.

Docker:

docker run -p 3000:3000 ghcr.io/us/crw:latest

Docker Compose (with JS rendering):

docker compose up

Scrape a page:

curl -X POST http://localhost:3000/v1/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'
{
  "success": true,
  "data": {
    "markdown": "# Example Domain\nThis domain is for use in ...",
    "metadata": {
      "title": "Example Domain",
      "sourceURL": "https://example.com",
      "statusCode": 200,
      "elapsedMs": 32
    }
  }
}

Use Cases

  • RAG pipelines — crawl websites and extract structured data for vector databases
  • AI agents — give Claude Code or Claude Desktop web scraping tools via MCP
  • Content monitoring — periodic crawl with LLM extraction to track changes
  • Data extraction — combine CSS selectors + LLM to extract any schema from any page
  • Web archiving — full-site BFS crawl to markdown

API Endpoints

MethodEndpointDescription
POST/v1/scrapeScrape a single URL, optionally with LLM extraction
POST/v1/crawlStart async BFS crawl (returns job ID)
GET/v1/crawl/:idCheck crawl status and retrieve results
DELETE/v1/crawl/:idCancel a running crawl job
POST/v1/mapDiscover all URLs on a site
GET/healthHealth check (no auth required)
POST/mcpStreamable HTTP MCP transport

LLM Structured Extraction

Send a JSON schema with your scrape request and CRW returns validated structured data using LLM function calling.

curl -X POST http://localhost:3000/v1/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/product",
    "formats": ["json"],
    "jsonSchema": {
      "type": "object",
      "properties": {
        "name": { "type": "string" },
        "price": { "type": "number" }
      },
      "required": ["name", "price"]
    }
  }'
  • Anthropic — uses tool_use with input_schema for extraction
  • OpenAI — uses function calling with parameters schema
  • Validation — LLM output is validated against your JSON schema before returning

Configure the LLM provider in your config:

[extraction.llm]
provider = "anthropic"        # "anthropic" or "openai"
api_key = "sk-..."            # or CRW_EXTRACTION__LLM__API_KEY env var
model = "claude-sonnet-4-20250514"

MCP Server

CRW works as an MCP tool server for any AI assistant that supports MCP. It provides 4 tools: crw_scrape, crw_crawl, crw_check_crawl_status, crw_map.

Also available on the MCP Registry

Install the MCP stdio binary:

# With npm (no Rust required):
npx crw-mcp

# Or with Cargo:
cargo install crw-mcp

Claude Code

claude mcp add --transport http crw http://localhost:3000/mcp

Claude Desktop

Edit your config file:

OSPath
macOS~/Library/Application Support/Claude/claude_desktop_config.json
Windows%APPDATA%\Claude\claude_desktop_config.json
Linux~/.config/Claude/claude_desktop_config.json
{
  "mcpServers": {
    "crw": {
      "command": "/absolute/path/to/crw-mcp",
      "env": { "CRW_API_URL": "http://localhost:3000" }
    }
  }
}

Cursor

Edit ~/.cursor/mcp.json (global) or .cursor/mcp.json (project):

{
  "mcpServers": {
    "crw": {
      "command": "/absolute/path/to/crw-mcp",
      "env": { "CRW_API_URL": "http://localhost:3000" }
    }
  }
}

Windsurf

Edit ~/.codeium/windsurf/mcp_config.json:

{
  "mcpServers": {
    "crw": {
      "command": "/absolute/path/to/crw-mcp",
      "env": { "CRW_API_URL": "http://localhost:3000" }
    }
  }
}

Cline (VS Code)

{
  "mcpServers": {
    "crw": {
      "command": "/absolute/path/to/crw-mcp",
      "env": { "CRW_API_URL": "http://localhost:3000" },
      "alwaysAllow": ["crw_scrape", "crw_map"],
      "disabled": false
    }
  }
}

Continue.dev (VS Code / JetBrains)

Edit ~/.continue/config.yaml:

mcpServers:
  - name: crw
    command: /absolute/path/to/crw-mcp
    env:
      CRW_API_URL: http://localhost:3000

OpenAI Codex CLI

Edit ~/.codex/config.toml:

[mcp_servers.crw]
command = "/absolute/path/to/crw-mcp"

[mcp_servers.crw.env]
CRW_API_URL = "http://localhost:3000"

Other MCP Clients

Any MCP-compatible client can connect to CRW using the standard JSON format:

{
  "mcpServers": {
    "crw": {
      "command": "/absolute/path/to/crw-mcp",
      "env": { "CRW_API_URL": "http://localhost:3000" }
    }
  }
}

Tip: The stdio binary (crw-mcp) works with any client. For clients that support HTTP transport, use http://localhost:3000/mcp directly — no binary needed.

See the full MCP setup guide for detailed instructions, auth configuration, and platform comparison.

JS Rendering

CRW auto-detects SPAs by analyzing the initial HTML response for shell heuristics (empty body, framework markers). When a SPA is detected, it renders the page via a headless browser.

Quick setup (recommended):

crw-server setup

This automatically downloads the LightPanda binary to ~/.local/bin/ and creates a config.local.toml with the correct renderer settings. Then start LightPanda and CRW:

lightpanda serve --host 127.0.0.1 --port 9222 &
crw-server

Supported renderers:

RendererProtocolBest for
LightPandaCDP over WebSocketLow-resource environments (default)
PlaywrightCDP over WebSocketFull browser compatibility
ChromeCDP over WebSocketExisting Chrome infrastructure

Renderer mode is configured via renderer.mode: auto (default), lightpanda, playwright, chrome, or none.

With Docker Compose, LightPanda runs as a sidecar — no extra setup needed:

docker compose up

Architecture

┌─────────────────────────────────────────────┐
│                 crw-server                  │
│         Axum HTTP API + Auth + MCP          │
├──────────┬──────────┬───────────────────────┤
│ crw-crawl│crw-extract│    crw-renderer      │
│ BFS crawl│ HTML→MD   │  HTTP + CDP(WS)      │
│ robots   │ LLM/JSON  │  LightPanda/Chrome   │
│ sitemap  │ clean/read│  auto-detect SPA     │
├──────────┴──────────┴───────────────────────┤
│                 crw-core                    │
│        Types, Config, Errors                │
└─────────────────────────────────────────────┘

Configuration

CRW uses layered TOML configuration with environment variable overrides:

  1. config.default.toml — built-in defaults
  2. config.local.toml — local overrides (or set CRW_CONFIG=myconfig)
  3. Environment variables — CRW_ prefix with __ separator (e.g. CRW_SERVER__PORT=8080)
[server]
host = "0.0.0.0"
port = 3000
rate_limit_rps = 10        # Max requests/second (global). 0 = unlimited.

[renderer]
mode = "auto"  # auto | lightpanda | playwright | chrome | none

[crawler]
max_concurrency = 10
requests_per_second = 10.0
respect_robots_txt = true

[auth]
# api_keys = ["fc-key-1234"]

See full configuration reference for all options.

Integration Examples

Python:

import requests

response = requests.post("http://localhost:3000/v1/scrape", json={
    "url": "https://example.com",
    "formats": ["markdown", "links"]
})
data = response.json()["data"]
print(data["markdown"])

Node.js:

const response = await fetch("http://localhost:3000/v1/scrape", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    url: "https://example.com",
    formats: ["markdown", "links"]
  })
});
const { data } = await response.json();
console.log(data.markdown);

CrewAI (crewai-crw — CRW tools for CrewAI):

pip install crewai-crw
from crewai import Agent, Task, Crew
from crewai_crw import CrwScrapeWebsiteTool, CrwCrawlWebsiteTool, CrwMapWebsiteTool

scrape_tool = CrwScrapeWebsiteTool()  # uses localhost:3000 by default

researcher = Agent(
    role="Web Researcher",
    goal="Research and summarize information from websites",
    backstory="Expert at extracting key information from web pages",
    tools=[scrape_tool],
)

task = Task(
    description="Scrape https://example.com and summarize the content",
    expected_output="A summary of the page content",
    agent=researcher,
)

crew = Crew(agents=[researcher], tasks=[task])
result = crew.kickoff()

LangChain (langchain-crw — CRW document loader):

pip install langchain-crw
from langchain_crw import CrwLoader

# Self-hosted (default: localhost:3000)
loader = CrwLoader(url="https://example.com", mode="scrape")
docs = loader.load()
print(docs[0].page_content)  # clean markdown

# Cloud (fastcrw.com)
loader = CrwLoader(
    url="https://example.com",
    api_url="https://fastcrw.com/api",
    api_key="crw_live_...",
    mode="crawl",
    params={"max_depth": 3, "max_pages": 50},
)
docs = loader.load()

More integrations (PRs pending): Flowise · Agno · n8n · All integrations

Docker

Pre-built image from GHCR:

docker pull ghcr.io/us/crw:latest
docker run -p 3000:3000 ghcr.io/us/crw:latest

Docker Compose (with JS rendering sidecar):

docker compose up

This starts CRW on port 3000 with LightPanda as a JS rendering sidecar on port 9222. CRW auto-connects to LightPanda for SPA rendering.

Benchmark

Tested on Firecrawl's scrape-content-dataset-v1 (1,000 real-world URLs, JS rendering enabled):

MetricCRWFirecrawl v2.5Crawl4AISpider
Coverage92.0%77.2%99.9%
Avg Latency833ms4,600ms
P50 Latency446ms45ms (static)
Noise Rejection88.4%noise 6.8%noise 11.3%noise 4.2%
Cost / 1,000 scrapes$0 (self-hosted)$0.83–5.33$0$0.65
Idle RAM6.6 MB~500 MB+cloud

CRW hits a sweet spot: near-Spider speed with Firecrawl API compatibility and ~75x less RAM. Unlike Crawl4AI (Python + Playwright), CRW ships as a single Rust binary with no runtime dependencies.

Run the benchmark yourself:

uv pip install datasets aiohttp
uv run python bench/run_bench.py

Crates

CrateDescription
crw-coreCore types, config, and error handlingcrates.io
crw-rendererHTTP + CDP browser rendering enginecrates.io
crw-extractHTML → markdown/plaintext extractioncrates.io
crw-crawlAsync BFS crawler with robots.txt & sitemapcrates.io
crw-serverAxum API server (Firecrawl-compatible)crates.io
crw-cliStandalone CLI (crw binary, no server needed)crates.io
crw-mcpMCP stdio proxy binarycrates.io

See docs/crates.md for usage examples and cargo add instructions.

Documentation

Full documentation: docs/index.md

Security

CRW includes built-in protections against common web scraping attack vectors:

  • SSRF protection — all URL inputs (REST API + MCP) are validated against private/internal networks:
    • Loopback (127.0.0.0/8, ::1, localhost)
    • Private IPs (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16)
    • Link-local / cloud metadata (169.254.0.0/16 — blocks AWS/GCP metadata endpoints)
    • IPv6 mapped addresses (::ffff:127.0.0.1), link-local (fe80::), ULA (fc00::/7)
    • Non-HTTP schemes (file://, ftp://, gopher://, data:)
  • Auth — optional Bearer token with constant-time comparison (no length or key-index leakage)
  • robots.txt — respects Allow/Disallow with wildcard patterns (*, $) and RFC 9309 specificity
  • Rate limiting — configurable per-second request cap with token-bucket algorithm (returns 429 with error_code)
  • Resource limits — max body size (1 MB), max crawl depth (10), max pages (1000), max discovered URLs (5000)

Community

Don't be shy — join us on Discord! Ask questions, share what you're building, report bugs, or just hang out.

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

  1. Fork the repository
  2. Install pre-commit hooks: make hooks
  3. Create your feature branch (git checkout -b feat/my-feature)
  4. Commit your changes (git commit -m 'feat: add my feature')
  5. Push to the branch (git push origin feat/my-feature)
  6. Open a Pull Request

The pre-commit hook runs the same checks as CI (cargo fmt, cargo clippy, cargo test). You can also run them manually with make check.

License

CRW is open-source under AGPL-3.0. For a managed version without AGPL obligations, see fastcrw.com.

Reviews

No reviews yet

Sign in to write a review