MCP Hub
Back to servers

metrillm

Benchmark local LLM models — speed, quality & hardware fitness verdict from any MCP client

Stars
1
Updated
Mar 3, 2026

Quick Install

npx -y metrillm-mcp

MetriLLM

CI Node.js License

npm version npm downloads GitHub stars

Benchmark your local LLM models in one command. Speed, quality, hardware fitness — with a shareable score and public leaderboard.

Think Geekbench, but for local LLMs on your actual hardware.

npx metrillm@latest bench

MetriLLM CLI — interactive menu MetriLLM CLI — hardware detection

MetriLLM Leaderboard

What You Get

  • Performance metrics: tokens/sec, time to first token, memory usage, load time
  • Quality evaluation: reasoning, coding, math, instruction following, structured output, multilingual (14 prompts, 6 categories)
  • Global score (0-100): 30% hardware fit + 70% quality
  • Verdict: EXCELLENT / GOOD / MARGINAL / NOT RECOMMENDED
  • One-click share: --share uploads your result and gives you a public URL + leaderboard rank

Real Benchmark Results

From the public leaderboard — all results below were submitted with metrillm bench --share.

ModelMachineCPURAMtok/sTTFTGlobalVerdict
llama3.2:latestMac MiniApple M4 Pro64 GB98.9125 ms77GOOD
mistral:latestMac MiniApple M4 Pro64 GB54.3124 ms76GOOD
gemma3:4bMacBook AirApple M432 GB35.9303 ms72GOOD
gemma3:1bMacBook AirApple M432 GB39.4362 ms72GOOD
qwen3:1.7bMacBook AirApple M432 GB37.93.1 s70GOOD
llama3.2:3bMacBook AirApple M432 GB27.8285 ms69GOOD
gemma3:12bMacBook AirApple M432 GB12.3656 ms67GOOD
phi4:14bMacBook AirApple M432 GB11.1515 ms65GOOD
mistral:7bMacBook AirApple M432 GB13.6517 ms61GOOD
deepseek-r1:14bMacBook AirApple M432 GB10.830.0 s25NOT RECOMMENDED

Key takeaway: Small models (1-4B) fly on Apple Silicon. Larger models (14B+) with thinking chains can choke even on capable hardware. See full leaderboard →

Install

Requires Node 20+ and a local runtime: Ollama or LM Studio.

# Run directly (no install)
npx metrillm@latest bench

# Or install globally
npm i -g metrillm
metrillm bench

# Homebrew (no global npm install)
# One-liner install (without pre-tapping):
brew install MetriLLM/metrillm/metrillm

# Or one-time tap for short install command:
brew tap MetriLLM/metrillm
# Then:
brew install metrillm
metrillm bench

# Alternative package managers
pnpm dlx metrillm@latest bench
bunx metrillm@latest bench

Usage

# Interactive mode — pick models from a menu
metrillm bench

# Benchmark a specific model
metrillm bench --model gemma3:4b

# Benchmark with LM Studio backend
metrillm bench --backend lm-studio --model qwen3-8b

# Benchmark all installed models
metrillm bench --all

# Share your result (upload + public URL + leaderboard rank)
metrillm bench --share

# CI/non-interactive mode
metrillm bench --ci-no-menu --share

# Force unload after each model (useful for memory isolation)
metrillm bench --all --unload-after-bench

# Export results locally
metrillm bench --export json
metrillm bench --export csv

Runtime Backends

BackendFlagDefault URLRequired env
Ollama--backend ollamahttp://127.0.0.1:11434OLLAMA_HOST (optional)
LM Studio--backend lm-studiohttp://127.0.0.1:1234LM_STUDIO_BASE_URL (optional), LM_STUDIO_API_KEY (optional), LM_STUDIO_STREAM_STALL_TIMEOUT_MS (optional)

For very large models, tune timeout flags:

  • --perf-warmup-timeout-ms (default 300000)
  • --perf-prompt-timeout-ms (default 120000)
  • --quality-timeout-ms (default 120000)
  • --coding-timeout-ms (default 240000)
  • --lm-studio-stream-stall-timeout-ms (default 180000, 0 disables stall timeout)

How Scoring Works

Hardware Fit Score (0-100) — how well the model runs on your machine:

  • Speed: 40% (tokens/sec relative to your hardware tier)
  • TTFT: 30% (time to first token)
  • Memory: 30% (RAM efficiency)

Quality Score (0-100) — how well the model answers:

  • Reasoning: 20pts | Coding: 20pts | Instruction Following: 20pts
  • Structured Output: 15pts | Math: 15pts | Multilingual: 10pts

Global Score = 30% Hardware Fit + 70% Quality

Hardware is auto-detected and scoring adapts to your tier (Entry/Balanced/High-End). A model hitting 10 tok/s on a 8GB machine scores differently than on a 64GB rig.

Full methodology →

Share Your Results

Every benchmark you share enriches the public leaderboard. No account needed — pick the method that fits your workflow:

MethodCommand / ActionBest for
CLImetrillm bench --shareTerminal users
MCPCall share_result toolAI coding assistants
Plugin/benchmark skill with share optionClaude Code / Cursor

All methods produce the same result:

  • A public URL for your benchmark
  • Your rank: "Top X% globally, Top Y% on [your CPU]"
  • A share card for social media
  • A challenge link to send to friends

Compare your results on the leaderboard →

MCP Server

Use MetriLLM from Claude Code, Cursor, Windsurf, or any MCP client — no CLI needed.

# Claude Code
claude mcp add metrillm -- npx metrillm-mcp@latest

# Claude Desktop / Cursor / Windsurf — add to MCP config:
# { "command": "npx", "args": ["metrillm-mcp@latest"] }
ToolDescription
list_modelsList locally available LLM models
run_benchmarkRun full benchmark (performance + quality) on a model
get_resultsRetrieve previous benchmark results
share_resultUpload a result to the public leaderboard

Full MCP documentation →

Skills

Slash commands that work inside AI coding assistants — no server needed, just a Markdown file.

SkillTriggerDescription
/benchmarkUser-invokedRun a full benchmark interactively
metrillm-guideAuto-invokedContextual guidance on model selection and results

Skills are included in the plugins below, or can be installed standalone:

# Claude Code
cp -r plugins/claude-code/skills/* ~/.claude/skills/

# Cursor
cp -r plugins/cursor/skills/* ~/.cursor/skills/

Plugins

Pre-built bundles (MCP + skills + agents) for deeper IDE integration.

ComponentDescription
MCP configAuto-connects to metrillm-mcp server
Skills/benchmark + metrillm-guide
Agentbenchmark-advisor — analyzes your hardware and recommends models

Install:

# Claude Code
cp -r plugins/claude-code/.claude/* ~/.claude/

# Cursor
cp -r plugins/cursor/.cursor/* ~/.cursor/

See Claude Code plugin and Cursor plugin for details.

Integrations

IntegrationPackageStatusDocs
CLImetrillmStableUsage
MCP Servermetrillm-mcpStableMCP docs
SkillsStableSkills
Claude Code pluginStablePlugin docs
Cursor pluginStablePlugin docs

Development

npm ci
npm run ci:verify     # typecheck + tests + build
npm run dev           # run from source
npm run test:watch    # vitest watch mode

Homebrew Formula Maintenance

The tap formula lives in Formula/metrillm.rb.

# Refresh Formula/metrillm.rb with latest npm tarball + sha256
./scripts/update-homebrew-formula.sh

# Or pin a specific version
./scripts/update-homebrew-formula.sh 0.1.1

After updating the formula, commit and push so users can install/update with:

brew tap MetriLLM/metrillm
brew install metrillm
brew upgrade metrillm

Contributing

Contributions are welcome! Please read the Contributing Guide before submitting a pull request. All commits must include a DCO sign-off.

License

Apache License 2.0 — see NOTICE for trademark information.

Reviews

No reviews yet

Sign in to write a review