mcp-web-scraper

This package uses Google Chrome's headless APIs to scrape web pages for AI/LLM agents.

Because it uses Chrome as its default user agent, any sites that require Javascript (for example, single page applications) should also be parsable with this tool.

It supports being called either from Go (go lang) via LangChainGo, or as an MCP server.

MCP Server

First compile the code using go:

go build .

Claude Desktop

{
  "mcpServers": {
    "mcp-web-scraper": {
      "command": "/path/to/mcp-web-scraper",
      "args": []
    }
  }
}

Visual Studio Code

{
  "mcp": {
    "servers": {
      "mcp-web-scraper": {
        "command": "/path/to/mcp-web-scraper",
        "args": []
      }
    }
  }
}

LangChainGo tool

Integration into langchain is easy:

import 	"github.com/lmorg/mcp-web-scraper/langchain"

func example() {
    scraper := langchain.NewScraper()
}

Please consult the langchaingo docs for how to use tools with their libraries.

Fallback Modes

If Google Chrome is not installed

If you do not have Google Chrome installed, then mcp-web-scraper will fallback to use Go's HTTP user agent.

This will work in the majority of cases, however you might not get any content for sites that requires Javascript to render.

Markdown Support

By default this module will look for <article> and convert that to Markdown.

If either the page doesn't present itself as an article of some description (eg not a blog, technical documentation, etc) then this module will fallback to returning HTML.

Any HTML document returned will have specific HTML tags removed (such as <script>, <svg>, and HTML comments) to reduce the tokens required for the LLM to parse

Web Scraper