MCP Hub
Back to servers

cortex-scout

An advanced web extraction and meta-search engine for AI agents. It features native parallel searching, Human-in-the-Loop (HITL) authentication fallback, and LLM-optimized data synthesis for deep web research.

Stars
53
Forks
4
Updated
Feb 26, 2026
Validated
Feb 27, 2026

CortexScout (cortex-scout) — Search and Web Extraction Engine for AI Agents

CortexScout is the Deep Research & Web Extraction module within the Cortex-Works ecosystem.

Designed for agent workloads that require token-efficient web retrieval, reliable anti-bot handling, and optional Human-in-the-Loop (HITL) fallback.

MIT License Built with Rust MCP CI


Overview

CortexScout provides a single, self-hostable Rust binary that exposes search and extraction capabilities over MCP (stdio) and an optional HTTP server. Output formats are structured and optimized for downstream LLM use.

It is built to handle the practical failure modes of web retrieval (rate limits, bot challenges, JavaScript-heavy pages) through progressive fallbacks: native retrieval → Chromium CDP rendering → HITL workflows.


Tools (Capability Roster)

AreaMCP Tools / Capabilities
Searchweb_search, web_search_json (parallel meta-search + dedup/scoring)
Fetchweb_fetch, web_fetch_batch (token-efficient clean output, optional semantic filtering)
Crawlweb_crawl (bounded discovery for doc sites / sub-pages)
Extractionextract_fields, fetch_then_extract (schema-driven extraction)
Anti-bot handlingCDP rendering, proxy rotation, block-aware retries
HITLvisual_scout (screenshot for gate confirmation), human_auth_session (authenticated fetch with persisted sessions), non_robot_search (last resort rendering)
Memorymemory_search (LanceDB-backed research history)
Deep researchdeep_research (multi-hop search + scrape + synthesis via OpenAI-compatible APIs)

Ecosystem Integration

While CortexScout runs as a standalone tool today, it is designed to integrate with CortexDB and CortexStudio for multi-agent scaling, shared retrieval artifacts, and centralized governance.


Anti-Bot Efficacy & Validation

This repository includes captured evidence artifacts that validate extraction and HITL flows against representative protected targets.

TargetProtectionEvidenceNotes
LinkedInCloudflare + AuthJSON · SnippetAuth-gated listings extraction
TicketmasterCloudflare TurnstileJSON · SnippetChallenge-handled extraction
AirbnbDataDomeJSON · SnippetLarge result sets under bot controls
UpworkreCAPTCHAJSON · SnippetProtected listings retrieval
AmazonAWS ShieldJSON · SnippetSearch result extraction
nowsecure.nlCloudflareJSONManual return path validated

See proof/README.md for methodology and raw outputs.


Quick Start

Option A — Prebuilt binaries

Download the latest release assets from GitHub Releases and run one of:

  • cortex-scout-mcp — MCP stdio server (recommended for VS Code / Cursor / Claude Desktop)
  • cortex-scout — optional HTTP server (default port 5000; override via --port, PORT, or CORTEX_SCOUT_PORT)

Health check (HTTP server):

./cortex-scout --port 5000
curl http://localhost:5000/health

Option B — Build from source

git clone https://github.com/cortex-works/cortex-scout.git
cd cortex-scout

cd mcp-server
cargo build --release --all-features

MCP Integration (VS Code / Cursor / Claude Desktop)

Add a server entry to your MCP config. Example for VS Code (stdio transport):

{
  "servers": {
    "cortex-scout": {
      "type": "stdio",
      "command": "env",
      "args": [
        "RUST_LOG=info",
        "SEARCH_ENGINES=google,bing,duckduckgo,brave",
        "LANCEDB_URI=/YOUR_PATH/cortex-scout/lancedb",
        "HTTP_TIMEOUT_SECS=30",
        "MAX_CONTENT_CHARS=10000",
        "IP_LIST_PATH=/YOUR_PATH/cortex-scout/ip.txt",
        "PROXY_SOURCE_PATH=/YOUR_PATH/cortex-scout/proxy_source.json",
        "/YOUR_PATH/cortex-scout/mcp-server/target/release/cortex-scout-mcp"
      ]
    }
  }
}

Multi-IDE guide: docs/IDE_SETUP.md


Configuration (cortex-scout.json)

Create cortex-scout.json in the same directory as the binary (or repository root). All fields are optional; environment variables act as fallback.

{
  "deep_research": {
    "enabled": true,
    "llm_base_url": "http://localhost:1234/v1",
    "llm_api_key": "",
    "llm_model": "lfm2-2.6b",
    "synthesis_enabled": true,
    "synthesis_max_sources": 3,
    "synthesis_max_chars_per_source": 800,
    "synthesis_max_tokens": 1024
  }
}

Key Environment Variables

VariableDefaultDescription
CHROME_EXECUTABLEauto-detectedOverride path to Chromium/Chrome/Brave
SEARCH_ENGINESgoogle,bing,duckduckgo,braveActive engines (comma-separated)
SEARCH_MAX_RESULTS_PER_ENGINE10Results per engine before merge
SEARCH_CDP_FALLBACKautoRetry blocked retrieval via native Chromium CDP
LANCEDB_URIPath for semantic memory (optional)
CORTEX_SCOUT_MEMORY_DISABLED0Set 1 to disable memory features
HTTP_TIMEOUT_SECS30Per-request timeout
OUTBOUND_LIMIT32Max concurrent outbound connections
MAX_CONTENT_CHARS10000Max chars per scraped document
IP_LIST_PATHProxy IP list path
PROXY_SOURCE_PATHProxy source definition path
DEEP_RESEARCH_ENABLED1Disable the deep_research tool at runtime by setting 0
OPENAI_API_KEYAPI key for synthesis (omit for key-less local endpoints)
OPENAI_BASE_URLhttps://api.openai.com/v1OpenAI-compatible endpoint (Ollama/LM Studio supported)
DEEP_RESEARCH_LLM_MODELgpt-4o-miniModel name (OpenAI-compatible)
DEEP_RESEARCH_SYNTHESIS_MAX_TOKENS1024Response token budget for synthesis

Agent Best Practices

Recommended operational flow:

  1. Use memory_search before new research runs to avoid re-fetching.
  2. Prefer web_search_json for initial discovery (search + content summaries).
  3. Use web_fetch for known URLs; use output_format="clean_json" and set query + strict_relevance=true for token efficiency.
  4. On 403/429/rate-limit: call proxy_control with action:"grab", then retry with use_proxy:true.
  5. For auth walls: visual_scout to confirm gating, then human_auth_session to complete login and persist sessions under ~/.cortex-scout/sessions/.

Full agent rules: /.github/copilot-instructions.md


Versioning and Changelog

See CHANGELOG.md.


License

MIT. See LICENSE.

Reviews

No reviews yet

Sign in to write a review