MCP Hub
Back to servers

crawl-mcp-server

A powerful MCP server for web crawling and search that converts HTML into LLM-optimized Markdown using Mozilla's Readability, featuring SearXNG integration and concurrent processing.

Tools
11
Updated
Jan 14, 2026

crawl-mcp-server

A comprehensive MCP (Model Context Protocol) server providing 11 powerful tools for web crawling and search. Transform web content into clean, LLM-optimized Markdown or search the web with SearXNG integration.

CI Tests codecov

✨ Features

  • 🔍 SearXNG Web Search - Search the web with automatic browser management
  • 📄 4 Crawling Tools - Extract and convert web content to Markdown
  • 🚀 Auto-Browser-Launch - Search tools automatically manage browser lifecycle
  • 📦 11 Total Tools - Complete toolkit for web interaction
  • 💾 Built-in Caching - SHA-256 based caching with graceful fallbacks
  • Concurrent Processing - Handle multiple URLs simultaneously (up to 50)
  • 🎯 LLM-Optimized Output - Clean Markdown perfect for AI consumption
  • 🛡️ Robust Error Handling - Graceful failure with detailed error messages
  • 🧪 Comprehensive Testing - Full CI/CD with performance benchmarks

📦 Installation

Method 1: npm (Recommended)

npm install crawl-mcp-server

Method 2: Direct from Git

# Install latest from GitHub
npm install git+https://github.com/Git-Fg/searchcrawl-mcp-server.git

# Or specific branch
npm install git+https://github.com/Git-Fg/searchcrawl-mcp-server.git#main

# Or from a fork
npm install git+https://github.com/YOUR_FORK/searchcrawl-mcp-server.git

Method 3: Clone and Build

git clone https://github.com/Git-Fg/searchcrawl-mcp-server.git
cd crawl-mcp-server
npm install
npm run build

Method 4: npx (No Installation)

# Run directly without installing
npx git+https://github.com/Git-Fg/searchcrawl-mcp-server.git

🔧 Setup for Claude Code

Option 1: MCP Desktop (Recommended)

Add to your Claude Desktop configuration file:

** macOS/Linux: ~/.config/claude/claude_desktop_config.json**

{
  "mcpServers": {
    "crawl-server": {
      "command": "npx",
      "args": [
        "git+https://github.com/Git-Fg/searchcrawl-mcp-server.git"
      ],
      "env": {
        "NODE_ENV": "production"
      }
    }
  }
}

** Windows: %APPDATA%\Claude\claude_desktop_config.json**

{
  "mcpServers": {
    "crawl-server": {
      "command": "npx",
      "args": [
        "git+https://github.com/Git-Fg/searchcrawl-mcp-server.git"
      ],
      "env": {
        "NODE_ENV": "production"
      }
    }
  }
}

Option 2: Local Installation

If you've installed locally:

{
  "mcpServers": {
    "crawl-server": {
      "command": "node",
      "args": [
        "/path/to/crawl-mcp-server/dist/index.js"
      ],
      "env": {}
    }
  }
}

Option 3: Custom Path

For a specific installation:

{
  "mcpServers": {
    "crawl-server": {
      "command": "node",
      "args": [
        "/usr/local/lib/node_modules/crawl-mcp-server/dist/index.js"
      ],
      "env": {}
    }
  }
}

After configuration, restart Claude Desktop.

🔧 Setup for Other MCP Clients

Claude CLI

# Using npx
claude mcp add crawl-server npx git+https://github.com/Git-Fg/searchcrawl-mcp-server.git

# Using local installation
claude mcp add crawl-server node /path/to/crawl-mcp-server/dist/index.js

Zed Editor

Add to ~/.config/zed/settings.json:

{
  "assistant": {
    "mcp": {
      "servers": {
        "crawl-server": {
          "command": "node",
          "args": ["/path/to/crawl-mcp-server/dist/index.js"]
        }
      }
    }
  }
}

VSCode with Copilot Chat

{
  "mcpServers": {
    "crawl-server": {
      "command": "node",
      "args": ["/path/to/crawl-mcp-server/dist/index.js"]
    }
  }
}

🚀 Quick Start

Using MCP Inspector (Testing)

# Install MCP Inspector globally
npm install -g @modelcontextprotocol/inspector

# Run the server
node dist/index.js

# In another terminal, test tools
npx @modelcontextprotocol/inspector --cli node dist/index.js --method tools/list

Development Mode

# Watch mode (auto-rebuild on changes)
npm run dev

# Build TypeScript
npm run build

# Run tests
npm run test:run

📚 Available Tools

Search Tools (7 tools)

1. search_searx

Search the web using SearXNG with automatic browser management.

// Example call
{
  "query": "TypeScript MCP server",
  "maxResults": 10,
  "category": "general",
  "timeRange": "week",
  "language": "en"
}

Parameters:

  • query (string, required): Search query
  • maxResults (number, default: 20): Results to return (1-50)
  • category (enum, default: general): one of general, images, videos, news, map, music, it, science
  • timeRange (enum, optional): one of day, week, month, year
  • language (string, default: en): Language code

Returns: JSON with search results array, URLs, and metadata


2. launch_chrome_cdp

Launch system Chrome with remote debugging for advanced SearX usage.

{
  "headless": true,
  "port": 9222,
  "userDataDir": "/path/to/profile"
}

Parameters:

  • headless (boolean, default: true): Run Chrome headless
  • port (number, default: 9222): Remote debugging port
  • userDataDir (string, optional): Custom Chrome profile

3. connect_cdp

Connect to remote CDP browser (Browserbase, etc.).

{
  "cdpWsUrl": "http://localhost:9222"
}

Parameters:

  • cdpWsUrl (string, required): CDP WebSocket URL or HTTP endpoint

4. launch_local

Launch bundled Chromium for SearX search.

{
  "headless": true,
  "userAgent": "custom user agent string"
}

Parameters:

  • headless (boolean, default: true): Run headless
  • userAgent (string, optional): Custom user agent

5. chrome_status

Check Chrome CDP status and health.

{}

Returns: Running status, health, endpoint URL, and PID


6. close

Close browser session (keeps Chrome CDP running).

{}

7. shutdown_chrome_cdp

Shutdown Chrome CDP and cleanup resources.

{}

Crawling Tools (4 tools)

1. crawl_read ⭐ (Simple & Fast)

Quick single-page extraction to Markdown.

{
  "url": "https://example.com/article",
  "options": {
    "timeout": 30000
  }
}

Best for:

  • ✅ News articles
  • ✅ Blog posts
  • ✅ Documentation pages
  • ✅ Simple content extraction

Returns: Clean Markdown content


2. crawl_read_batch ⭐ (Multiple URLs)

Process 1-50 URLs concurrently.

{
  "urls": [
    "https://example.com/article1",
    "https://example.com/article2",
    "https://example.com/article3"
  ],
  "options": {
    "maxConcurrency": 5,
    "timeout": 30000,
    "maxResults": 10
  }
}

Best for:

  • ✅ Processing multiple articles
  • ✅ Building content aggregates
  • ✅ Bulk content extraction

Returns: Array of Markdown results with summary statistics


3. crawl_fetch_markdown

Controlled single-page extraction with full option control.

{
  "url": "https://example.com/article",
  "options": {
    "timeout": 30000
  }
}

Best for:

  • ✅ Advanced crawling options
  • ✅ Custom timeout control
  • ✅ Detailed extraction

4. crawl_fetch

Multi-page crawling with intelligent link extraction.

{
  "url": "https://example.com",
  "options": {
    "pages": 5,
    "maxConcurrency": 3,
    "sameOriginOnly": true,
    "timeout": 30000,
    "maxResults": 20
  }
}

Best for:

  • ✅ Crawling entire sites
  • ✅ Link-based discovery
  • ✅ Multi-page scraping

Features:

  • Extracts links from starting page
  • Crawls discovered pages
  • Concurrent processing
  • Same-origin filtering (configurable)

💡 Usage Examples

Example 1: Search + Crawl Workflow

// Step 1: Search for topics
{
  "tool": "search_searx",
  "arguments": {
    "query": "TypeScript best practices 2024",
    "maxResults": 5
  }
}

// Step 2: Extract URLs from results
// (Parse the search results to get URLs)

// Step 3: Crawl selected articles
{
  "tool": "crawl_read_batch",
  "arguments": {
    "urls": [
      "https://example.com/article1",
      "https://example.com/article2",
      "https://example.com/article3"
    ]
  }
}

Example 2: Batch Content Extraction

{
  "tool": "crawl_read_batch",
  "arguments": {
    "urls": [
      "https://news.site/article1",
      "https://news.site/article2",
      "https://news.site/article3"
    ],
    "options": {
      "maxConcurrency": 10,
      "timeout": 30000,
      "maxResults": 3
    }
  }
}

Example 3: Site Crawling

{
  "tool": "crawl_fetch",
  "arguments": {
    "url": "https://docs.example.com",
    "options": {
      "pages": 10,
      "maxConcurrency": 5,
      "sameOriginOnly": true,
      "timeout": 30000,
      "maxResults": 10
    }
  }
}

🎯 Tool Selection Guide

Use CaseRecommended ToolComplexity
Single articlecrawl_readSimple
Multiple articlescrawl_read_batchSimple
Advanced optionscrawl_fetch_markdownMedium
Site crawlingcrawl_fetchComplex
Web searchsearch_searxSimple
Research workflowsearch_searxcrawl_readMedium

🏗️ Architecture

Core Components

┌─────────────────────────────────────────┐
│         crawl-mcp-server                │
├─────────────────────────────────────────┤
│                                          │
│  ┌──────────────────────────────┐      │
│  │     MCP Server Core         │      │
│  │  - 11 registered tools      │      │
│  │  - STDIO/HTTP transport    │      │
│  └──────────────────────────────┘      │
│              │                           │
│  ┌──────────────────────────────┐      │
│  │   @just-every/crawl         │      │
│  │  - HTML → Markdown          │      │
│  │  - Mozilla Readability       │      │
│  │  - Concurrent crawling      │      │
│  └──────────────────────────────┘      │
│              │                           │
│  ┌──────────────────────────────┐      │
│  │   Playwright (Browser)       │      │
│  │  - SearXNG integration       │      │
│  │  - Auto browser management   │      │
│  │  - Anti-detection            │      │
│  └──────────────────────────────┘      │
│                                          │
└─────────────────────────────────────────┘

Technology Stack

  • Runtime: Node.js 18+
  • Language: TypeScript 5.7
  • Framework: MCP SDK (@modelcontextprotocol/sdk)
  • Crawling: @just-every/crawl
  • Browser: Playwright Core
  • Validation: Zod
  • Transport: STDIO (local) + HTTP (remote)

Data Flow

Client Request
    ↓
MCP Protocol
    ↓
Tool Handler
    ↓
┌─────────────────────┐
│   Crawl/Search     │
│  @just-every/crawl │  →  HTML content
│   or SearXNG       │  →  Search results
└─────────────────────┘
    ↓
HTML → Markdown
    ↓
Result Formatting
    ↓
MCP Response
    ↓
Client

🧪 Testing

Run Test Suite

# All unit tests
npm run test:run

# Performance benchmarks
npm run test:performance

# Full CI suite
npm run test:ci

# Individual tool test
npx @modelcontextprotocol/inspector --cli node dist/index.js \
  --method tools/call \
  --tool-name crawl_read \
  --tool-arg url="https://example.com"

Test Coverage

  • ✅ All 11 tools tested
  • ✅ Error handling validated
  • ✅ Performance benchmarks
  • ✅ Integration workflows
  • ✅ Multi-Node support (Node 18, 20, 22)

CI/CD Pipeline

┌────────────────────────────────────┐
│        GitHub Actions              │
├────────────────────────────────────┤
│  1. Test (Matrix: Node 18,20,22) │
│  2. Integration Tests (PR only)    │
│  3. Performance Tests (main)       │
│  4. Security Scan                  │
│  5. Coverage Report                │
└────────────────────────────────────┘

🔧 Development

Prerequisites

  • Node.js 18 or higher
  • npm or yarn

Setup

# Clone the repository
git clone https://github.com/Git-Fg/searchcrawl-mcp-server.git
cd crawl-mcp-server

# Install dependencies
npm install

# Build TypeScript
npm run build

# Run in development mode (watch)
npm run dev

Development Commands

# Build project
npm run build

# Watch mode (auto-rebuild)
npm run dev

# Run tests
npm run test:run

# Lint code
npm run lint

# Type check
npm run typecheck

# Clean build artifacts
npm run clean

Project Structure

crawl-mcp-server/
├── src/
│   ├── index.ts          # Main server (11 tools)
│   ├── types.ts           # TypeScript interfaces
│   └── cdp.ts            # Chrome CDP manager
├── test/
│   ├── run-tests.ts       # Unit test suite
│   ├── performance.ts     # Performance tests
│   └── config.ts          # Test configuration
├── dist/                  # Compiled JavaScript
├── .github/workflows/      # CI/CD pipeline
└── package.json

📊 Performance

Benchmarks

OperationAvg DurationMax Memory
crawl_read~1500ms32MB
crawl_read_batch (2 URLs)~2500ms64MB
search_searx~4000ms128MB
crawl_fetch~2000ms48MB
tools/list~100ms8MB

Performance Features

  • ✅ Concurrent request processing (up to 20)
  • ✅ Built-in caching (SHA-256)
  • ✅ Automatic timeout management
  • ✅ Memory optimization
  • ✅ Resource cleanup

🛡️ Error Handling

All tools include comprehensive error handling:

  • Network errors: Graceful degradation with error messages
  • Timeout handling: Configurable timeouts
  • Partial failures: Batch operations continue on individual failures
  • Structured errors: Clear error codes and messages
  • Recovery: Automatic retries where appropriate

Example error response:

{
  "content": [
    {
      "type": "text",
      "text": "Error: Failed to fetch https://example.com: Timeout after 30000ms"
    }
  ],
  "structuredContent": {
    "error": "Network timeout",
    "url": "https://example.com",
    "code": "TIMEOUT"
  }
}

🔐 Security

  • No API keys required for basic crawling
  • Respect robots.txt (configurable)
  • User agent rotation
  • Rate limiting (built-in via concurrency limits)
  • Input validation (Zod schemas)
  • Dependency scanning (npm audit, Snyk)

🌐 Transport Modes

STDIO (Default)

For local MCP clients:

node dist/index.js

HTTP

For remote access:

TRANSPORT=http PORT=3000 node dist/index.js

Server runs on: http://localhost:3000/mcp

📝 Configuration

Environment Variables

# Transport mode (stdio or http)
TRANSPORT=stdio

# HTTP port (when TRANSPORT=http)
PORT=3000

# Node environment
NODE_ENV=production

Tool Configuration

Each tool accepts an options object:

{
  "timeout": 30000,          // Request timeout (ms)
  "maxConcurrency": 5,       // Concurrent requests (1-20)
  "maxResults": 10,          // Limit results (1-50)
  "respectRobots": false,    // Respect robots.txt
  "sameOriginOnly": true     // Only same-origin URLs
}

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Make changes and add tests
  4. Run tests: npm run test:ci
  5. Commit: git commit -m 'Add amazing feature'
  6. Push: git push origin feature/amazing-feature
  7. Open a Pull Request

Development Guidelines

  • Follow TypeScript strict mode
  • Add tests for new features
  • Update documentation
  • Run linting: npm run lint
  • Ensure CI passes

📄 License

MIT License - see LICENSE file

🙏 Acknowledgments

📞 Support

🚀 What's Next?

  • Add DuckDuckGo search support
  • Implement content filtering
  • Add screenshot capabilities
  • Support for authenticated content
  • PDF extraction
  • Real-time monitoring

Built with ❤️ using TypeScript, MCP, and modern web technologies.

Reviews

No reviews yet

Sign in to write a review