MCP Hub
Back to servers

MolmoWeb MCP Server

Enables autonomous web automation through Playwright browser control integrated with MolmoWeb's vision model for pixel-level action prediction. Supports end-to-end task execution via an orchestrator LLM that decomposes natural language instructions into browser actions.

glama
Updated
Apr 5, 2026

molmoweb-mcp

MCP server that exposes MolmoWeb web automation as tools for Claude (or any MCP client). Uses Playwright for browser control.

Architecture

Claude / MCP Client
  ↓ stdio (MCP protocol)
molmoweb-mcp (this server)
  ↓                    ↓
Playwright browser   MolmoWeb API (localhost:8001)

Tools

ToolDescription
molmoweb_check_statusHealth check for MolmoWeb backend
browser_navigateOpen URL in Playwright browser
browser_screenshotCapture JPEG screenshot (returns base64 image)
browser_get_page_infoGet current URL and title
browser_execute_actionExecute click/type/scroll/press_key/hover/navigate/wait
molmoweb_predictAsk MolmoWeb vision model what action to perform
run_web_taskFull autonomous agent loop (orchestrator + MolmoWeb + execution)

Setup

npm install
npx playwright install chromium

Start the MolmoWeb backend

The MolmoWeb vision model must be running at http://127.0.0.1:8001. On Windows with WSL:

# Using the provided script:
run_molmoweb.bat

Configure in Claude Code

Add to your ~/.mcp.json (global) or project .mcp.json:

{
  "mcpServers": {
    "molmoweb": {
      "command": "node",
      "args": ["/path/to/molmoweb-mcp/server.js"]
    }
  }
}

Run standalone

npm start

Orchestrator LLM Support

The run_web_task tool uses an LLM orchestrator to decompose tasks into step-by-step browser actions. Supported providers:

  • OpenAI: gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-4
  • Anthropic: claude-opus-4-6, claude-sonnet-4-6, claude-haiku-4-5-20251001
  • Custom: Any OpenAI-compatible endpoint (e.g., Ollama)

How It Works

  1. User provides a high-level task (e.g., "Search Google for AI news")
  2. The orchestrator LLM decomposes it into atomic browser instructions
  3. MolmoWeb vision model translates each instruction into pixel-level actions
  4. Playwright executes the actions in a visible Chromium browser
  5. Loop repeats until the task is complete or max steps reached

License

MIT

Reviews

No reviews yet

Sign in to write a review