MCP Hub
Back to servers

markdown-for-agents-mcp

MCP server for AI agents -- fetch any URL with full JavaScript rendering (Playwright/Chromium) and convert to clean, token-efficient markdown. Works on React, Vue, Angular, and any JS-heavy page. Includes web search, batch fetching, binary file download, LRU cache, SSRF protection, and structured output.

glama
Updated
Apr 6, 2026

markdown-for-agents-mcp

npm version npm downloads Node.js codecov License: MIT

An MCP (Model Context Protocol) server that fetches URLs with full JavaScript rendering and converts them to clean, token-efficient markdown for AI agents.

Most MCP fetch tools use plain HTTP — they see what a server sends without running any JavaScript. That works for static sites, but silently returns empty or broken content for React, Vue, Angular, SPAs, and any page that loads data dynamically. This server runs a real Chromium browser via Playwright, so it renders the full page before extraction — the same content a human user would see.

Powered by Playwright and the markdown-for-agents library. Strips ads, navigation, and boilerplate — delivering up to 80% fewer tokens than raw HTML.


Why Playwright?

CapabilityPlain HTTP fetchersmarkdown-for-agents-mcp
Static HTML pages
React / Vue / Angular apps
JavaScript-rendered content
Single-page app routes
Lazy-loaded / infinite-scroll
Token efficiency vs raw HTMLMediumUp to 80% fewer
Bot-detection evasionNoneUA rotation, webdriver spoofing

Token reduction example: a typical news article page is ~150 KB of raw HTML (~40,000 tokens). After Playwright rendering, DOM pruning, and markdown conversion the same article becomes ~2,000 tokens — a 95% reduction.


Table of Contents


Features

  • JavaScript Rendering — Playwright-driven Chromium renders React, Vue, Angular, and any JS-heavy page before extraction
  • Structured Output — Tools return typed structuredContent (url, title, markdown, fetchedAt, contentSize) alongside the text response, compatible with MCP SDK 1.11+
  • Smart Content Extraction — Scores and selects the main content block (main > article > #content > body), dropping sidebars, nav, and ads automatically
  • Token Efficiency — Produces compact LLM-ready markdown; benchmarks show up to 80% fewer tokens than raw HTML
  • Web Search — DuckDuckGo search with optional fetch-and-convert of top results
  • LRU Cache — 50 MB in-memory cache with a 15-minute TTL avoids redundant fetches
  • Domain Filtering — Built-in blocklist of trackers/social domains; supports per-request allow/block lists and server-level allowlist mode
  • Batch Fetching — Concurrent multi-URL fetches with configurable parallelism
  • HTTP Server Mode — Run as an HTTP server (--http [port] or HTTP_PORT env var) with optional bearer token auth
  • Proxy Support — Pass PLAYWRIGHT_PROXY to route Playwright traffic through a proxy
  • Health Monitoringhealth_check tool exposes cache and fetch metrics
  • Zero Configuration — Chromium is installed automatically on first run

Installation

npm install -g markdown-for-agents-mcp

Chromium is downloaded automatically via the postinstall script. If that fails, see Troubleshooting.

You can also run without installing globally using npx:

npx markdown-for-agents-mcp

MCP Client Setup

Add the server to your MCP client configuration.

Claude Desktop

Edit ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):

{
  "mcpServers": {
    "markdown": {
      "command": "markdown-mcp"
    }
  }
}

VS Code (Copilot / Continue)

Add to your workspace or user settings.json under the relevant MCP extension key, for example:

{
  "mcpServers": {
    "markdown": {
      "command": "markdown-mcp"
    }
  }
}

Cursor / Windsurf / Zed

Any client that implements the MCP specification can use this server. The command entry point is markdown-mcp (available on PATH after global install) or the full path to dist/index.js for local builds.

With environment variable overrides

{
  "mcpServers": {
    "markdown": {
      "command": "markdown-mcp",
      "env": {
        "FETCH_TIMEOUT_MS": "60000",
        "LOG_LEVEL": "DEBUG"
      }
    }
  }
}

HTTP server mode

Instead of stdio, you can run the server as a standard HTTP endpoint — useful for shared deployments, Docker, or any client that prefers the Streamable HTTP transport:

# Start on port 3456
markdown-mcp --http 3456

# Or use the env var
HTTP_PORT=3456 markdown-mcp

All MCP traffic is handled at POST|GET|DELETE /mcp. To require a bearer token, set MCP_AUTH_TOKEN:

MCP_AUTH_TOKEN=mysecrettoken HTTP_PORT=3456 markdown-mcp

Clients must then pass Authorization: Bearer mysecrettoken with every request.


Available Tools

fetch_url

Fetches a single URL with full JavaScript rendering and returns clean markdown.

Arguments:

NameTypeRequiredDescription
urlstringyesURL to fetch and convert
timeoutnumbernoRequest timeout in ms (overrides FETCH_TIMEOUT_MS)

Example:

fetch_url(url="https://example.com/blog/post")

Text output (always present, backward-compatible):

# Blog Post Title

Source: https://example.com/blog/post

This is the main content of the article, stripped of navigation, ads, and boilerplate.

## Related Section

More content here...

---
*Converted by markdown-for-agents-mcp*

Structured output (available to MCP SDK 1.11+ clients via structuredContent):

{
  "url": "https://example.com/blog/post",
  "title": "Blog Post Title",
  "markdown": "# Blog Post Title\n\nSource: ...",
  "fetchedAt": "2026-04-06T17:00:00.000Z",
  "contentSize": 2048
}

fetch_urls

Fetches multiple URLs concurrently and returns combined markdown, one section per URL.

Arguments:

NameTypeRequiredDescription
urlsstring[]yesURLs to fetch
timeoutnumbernoPer-request timeout in ms

Example:

fetch_urls(urls=[
  "https://example.com/post1",
  "https://example.com/post2"
])

Text output:

# Post 1 Title

Source: https://example.com/post1

...

---

# Post 2 Title

Source: https://example.com/post2

...

---

Structured output (via structuredContent):

{
  "results": [
    {
      "url": "https://example.com/post1",
      "title": "Post 1 Title",
      "markdown": "...",
      "fetchedAt": "2026-04-06T17:00:00.000Z",
      "contentSize": 1820,
      "success": true
    },
    {
      "url": "https://example.com/post2",
      "title": "Post 2 Title",
      "markdown": "...",
      "fetchedAt": "2026-04-06T17:00:00.000Z",
      "contentSize": 2104,
      "success": true
    }
  ],
  "summary": { "total": 2, "succeeded": 2, "failed": 0 }
}

Parallelism is controlled by MAX_CONCURRENT_FETCHES (default: 5).


web_search

Searches DuckDuckGo and optionally fetches top results as markdown. Uses a plain HTTP endpoint to avoid bot detection — no Playwright for the search itself.

Arguments:

NameTypeRequiredDescription
querystringyesSearch query
maxResultsnumbernoMax results to return (default: 10)
allowedDomainsstring[]noOnly include results from these domains
blockedDomainsstring[]noExclude results from these domains
fetchResultsbooleannoFetch and convert top result pages to markdown
timeoutnumbernoRequest timeout in ms

Example — search only:

web_search(
  query="typescript tutorials",
  maxResults=5,
  allowedDomains=["typescriptlang.org", "github.com"]
)

Example — search and fetch:

web_search(
  query="react hooks guide",
  fetchResults=true,
  maxResults=3
)

Text output:

# Web Search Results

## Query: typescript tutorials
**Found 5 results in 1234ms**

### Results:

1. [TypeScript Handbook](https://www.typescriptlang.org/docs/)
   The TypeScript Handbook provides comprehensive documentation...

2. [Best TypeScript Tutorials](https://github.com/danistefanovic/build-your-own-typescript)
   Learn TypeScript by building your own compiler...

Structured output (via structuredContent):

{
  "query": "typescript tutorials",
  "results": [
    { "title": "TypeScript Handbook", "url": "https://www.typescriptlang.org/docs/", "snippet": "...", "domain": "typescriptlang.org" }
  ],
  "fetchedContent": [
    { "url": "https://www.typescriptlang.org/docs/", "markdown": "..." }
  ],
  "durationMs": 1234
}

Note: allowedDomains and blockedDomains arguments apply to search result filtering only. Server-level BLOCKLIST_DOMAINS / USE_ALLOWLIST_MODE settings still apply when those results are subsequently fetched.


download_file

Downloads a binary file (PDF, image, ZIP, etc.) from a URL and saves it to a local path. Uses a plain HTTP client — no Playwright required. SSRF protection and domain block list are enforced.

Arguments:

NameTypeRequiredDescription
urlstringyesURL of the file to download
outputPathstringyesAbsolute local path to save the file to (parent directory must exist)

Example:

download_file(
  url="https://example.com/report.pdf",
  outputPath="/tmp/report.pdf"
)

Output:

{
  "savedPath": "/tmp/report.pdf",
  "sizeBytes": 204800,
  "mimeType": "application/pdf",
  "filename": "report.pdf"
}

Note: URLs with paths like /download/... are permitted for this tool even though they are blocked by fetch_url (to avoid binary download chains). Use fetch_url for HTML pages — download_file will reject text/html responses.


health_check

Returns current server status, cache metrics, and fetch statistics. Useful for monitoring and debugging.

Arguments: none

Example output:

{
  "status": "healthy",
  "cache": {
    "hits": 47,
    "misses": 15,
    "currentSize": 12,
    "totalBytes": 4194304,
    "maxBytes": 52428800
  },
  "metrics": {
    "totalFetches": 62,
    "successCount": 59,
    "errorCount": 3,
    "avgDuration": 1840,
    "cacheUtilization": 76
  }
}

CLI Usage

A standalone CLI (markdown-cli) is included for use outside the MCP protocol.

Single URL

markdown-cli https://example.com

Multiple URLs (batch mode)

markdown-cli -b https://example.com https://example.org https://example.net

Save to file

markdown-cli https://example.com/article > article.md

Download a binary file

markdown-cli -d -o /tmp/report.pdf https://example.com/report.pdf

Command reference

CommandDescription
markdown-cli <url>Fetch a single URL and print markdown
markdown-cli -b <url1> <url2> ...Fetch multiple URLs in batch mode
markdown-cli -d -o <path> <url>Download a binary file to a local path
markdown-cli --helpShow help

Configuration

All settings are read from environment variables at startup and validated with Zod. Invalid values cause a non-zero exit with a descriptive error.

Copy .env.example to .env to get started:

cp .env.example .env

Reference

VariableDefaultDescription
FETCH_TIMEOUT_MS30000Timeout per fetch request (ms)
MAX_CONCURRENT_FETCHES5Max parallel fetches in batch operations
MAX_REDIRECTS10Max redirect hops before error
MAX_CONTENT_LENGTH100000Max content size (chars) before truncation
LOG_LEVELINFODEBUG, INFO, WARN, or ERROR
LOG_FORMATtexttext (human-readable) or json (structured)
CACHE_MAX_BYTES52428800Max LRU cache size (50 MB)
CACHE_TTL_MS900000Cache entry TTL (15 minutes)
USE_ALLOWLIST_MODEfalseWhen true, only domains in BLOCKLIST_DOMAINS are allowed
BLOCKLIST_DOMAINS(empty)Comma-separated domains to block (or allow in allowlist mode)
BLOCKLIST_URL_PATTERNS(empty)Comma-separated regex patterns to block by URL path
WEB_SEARCH_DEFAULT_TIMEOUT_MS30000Default timeout for search requests (ms)
DOWNLOAD_TIMEOUT_MS60000Timeout for binary file downloads (ms)
HTTP_PORT(unset)When set, starts an HTTP server on this port instead of stdio
MCP_AUTH_TOKEN(unset)Bearer token required on all HTTP requests (HTTP mode only)
PLAYWRIGHT_PROXY(unset)Proxy server URL for Playwright (e.g. http://proxy.example.com:8080)
PLAYWRIGHT_PROXY_BYPASS(unset)Comma-separated domains to bypass the proxy

All logs are written to stderr to keep stdout clean for the MCP protocol.


Security

Default domain blocklist

The following domains are blocked by default to prevent accidental fetches of trackers, ad networks, and social platforms that aggressively block bots or serve low-quality content:

doubleclick.net, facebook.com, twitter.com, tiktok.com, hotjar.com, mixpanel.com, bit.ly, and approximately 20 others (see src/utils/domainBlacklist.ts for the full list).

If you need to fetch a blocked domain, add it to BLOCKLIST_DOMAINS with USE_ALLOWLIST_MODE=false — this adds to the block list and does not remove existing entries. To allow a default-blocked domain you will need to fork and modify domainBlacklist.ts.

URL path blocking

Certain URL path patterns are blocked regardless of domain (e.g. OAuth callbacks, binary file downloads, payment/checkout paths, admin panels). These protect against accidental fetches of sensitive or non-content URLs.

Allowlist mode

Set USE_ALLOWLIST_MODE=true and BLOCKLIST_DOMAINS=yourdomain.com,trusted.org to restrict the server to only fetching from explicitly listed domains. Recommended for production deployments.

Redirect policy

Cross-origin redirects are blocked. The server only follows same-origin redirect chains (up to MAX_REDIRECTS hops).

For the full security model and reporting vulnerabilities, see SECURITY.md.


Architecture

graph TD
    subgraph MCP["MCP Server Layer"]
        entry["index.ts\n(entry)"]
        fetchUrl["fetchUrl\n(tool)"]
        fetchUrls["fetchUrls\n(tool)"]
        webSearchTool["webSearch\n(tool)"]
        healthCheck["health_check\n(tool)"]
    end

    subgraph Services["Service Layer"]
        fetcher["fetcher\n(Playwright)"]
        converter["converter\n(HTML → MD)"]
        webSearchSvc["webSearch\n(DuckDuckGo)"]
        config["config\n(Zod)"]
    end

    subgraph Utils["Utilities"]
        cache["cache.ts"]
        blocklist["domainBlacklist.ts"]
        errors["errors.ts"]
        logger["logger.ts"]
    end

    entry --> fetchUrl
    entry --> fetchUrls
    entry --> webSearchTool
    entry --> healthCheck

    fetchUrl --> fetcher
    fetchUrl --> converter
    fetchUrls --> fetcher
    fetchUrls --> converter
    webSearchTool --> webSearchSvc
    webSearchSvc --> fetcher

    fetcher --> cache
    fetcher --> blocklist
    fetcher --> logger
    fetcher --> errors
    webSearchSvc --> blocklist
    converter --> config
    fetcher --> config

Key components

ComponentFileResponsibility
MCP entry pointsrc/index.tsTool registration, dispatch
Playwright fetchersrc/fetcher.tsJS rendering, DOM pruning, LRU cache
HTML convertersrc/converter.tsWraps markdown-for-agents with content scoring
DuckDuckGo searchsrc/services/webSearch.tsPlain HTTP search, result parsing
Configsrc/config.tsZod-validated env var schema
Cachesrc/utils/cache.tsLRU eviction with byte-level size tracking
Domain filtersrc/utils/domainBlacklist.tsBlock/allowlist logic
Loggersrc/utils/logger.tsStructured logging, per-domain metrics

Fetcher internals: a single persistent Chromium instance is shared across requests. Each fetch opens a fresh page (isolated state), applies DOM pruning to strip non-content elements, then extracts HTML using a priority selector chain. Browser fingerprint spoofing (navigator.webdriver = false, randomized UA) reduces bot-detection rejections.

Content conversion: the markdown-for-agents library uses content scoring and boilerplate detection (extract: true) to identify the main article body before converting to markdown with inline links (linkStyle: "inlined").


Development

Prerequisites

  • Node.js >= 20.0.0
  • npm

Setup

git clone https://github.com/JohnnyFoulds/markdown-for-agents-mcp.git
cd markdown-for-agents-mcp
npm install        # also installs Chromium via postinstall

Scripts

npm run build      # Compile TypeScript → dist/
npm run dev        # Watch mode
npm run typecheck  # Type-check without emitting
npm test           # Run Vitest suite

Running locally

npm run build
node dist/index.js

Debug logging

LOG_LEVEL=DEBUG node dist/index.js
LOG_LEVEL=DEBUG LOG_FORMAT=json node dist/index.js

Troubleshooting

Playwright fails to install Chromium

npx playwright install chromium

On Linux, also install OS-level dependencies:

npx playwright install-deps chromium

MCP connection issues

Capture server logs:

markdown-mcp 2>&1 | tee mcp.log

Domain blocked errors

By default, tracker, ad-network, and social media domains are blocked. Check src/utils/domainBlacklist.ts to see the full list. To add a domain to the blocklist (not remove from the default list), set BLOCKLIST_DOMAINS=yourdomain.com.

Build errors

rm -rf node_modules dist
npm install
npm run build

Contributing

Contributions are welcome. Please read CONTRIBUTING.md for full guidelines. The short version:

  1. Branch from development
  2. Follow Conventional Commits (type(scope): subject)
  3. Add or update tests — aim for >80% coverage
  4. Open a pull request against development

Changelog

See CHANGELOG.md for release history.


License

MIT

Reviews

No reviews yet

Sign in to write a review