MetriLLM

Benchmark your local LLM models in one command. Speed, quality, hardware fitness — with a shareable score and public leaderboard.

Think Geekbench, but for local LLMs on your actual hardware.

npx metrillm@latest bench

MetriLLM CLI — interactive menu MetriLLM CLI — hardware detection

What You Get

Performance metrics: tokens/sec, time to first token, memory usage, load time
Quality evaluation: reasoning, coding, math, instruction following, structured output, multilingual (14 prompts, 6 categories)
Global score (0-100): 30% hardware fit + 70% quality
Verdict: EXCELLENT / GOOD / MARGINAL / NOT RECOMMENDED
One-click share: --share uploads your result and gives you a public URL + leaderboard rank

Real Benchmark Results

From the public leaderboard — all results below were submitted with metrillm bench --share.

Model	Machine	CPU	RAM	tok/s	TTFT	Global	Verdict
llama3.2:latest	Mac Mini	Apple M4 Pro	64 GB	98.9	125 ms	77	GOOD
mistral:latest	Mac Mini	Apple M4 Pro	64 GB	54.3	124 ms	76	GOOD
gemma3:4b	MacBook Air	Apple M4	32 GB	35.9	303 ms	72	GOOD
gemma3:1b	MacBook Air	Apple M4	32 GB	39.4	362 ms	72	GOOD
qwen3:1.7b	MacBook Air	Apple M4	32 GB	37.9	3.1 s	70	GOOD
llama3.2:3b	MacBook Air	Apple M4	32 GB	27.8	285 ms	69	GOOD
gemma3:12b	MacBook Air	Apple M4	32 GB	12.3	656 ms	67	GOOD
phi4:14b	MacBook Air	Apple M4	32 GB	11.1	515 ms	65	GOOD
mistral:7b	MacBook Air	Apple M4	32 GB	13.6	517 ms	61	GOOD
deepseek-r1:14b	MacBook Air	Apple M4	32 GB	10.8	30.0 s	25	NOT RECOMMENDED

Key takeaway: Small models (1-4B) fly on Apple Silicon. Larger models (14B+) with thinking chains can choke even on capable hardware. See full leaderboard →

Install

Requires Node 20+ and a local runtime: Ollama or LM Studio.

# Run directly (no install)
npx metrillm@latest bench

# Or install globally
npm i -g metrillm
metrillm bench

# Homebrew (no global npm install)
# One-liner install (without pre-tapping):
brew install MetriLLM/metrillm/metrillm

# Or one-time tap for short install command:
brew tap MetriLLM/metrillm
# Then:
brew install metrillm
metrillm bench

# Alternative package managers
pnpm dlx metrillm@latest bench
bunx metrillm@latest bench

Usage

# Interactive mode — pick models from a menu
metrillm bench

# Benchmark a specific model
metrillm bench --model gemma3:4b

# Benchmark with LM Studio backend
metrillm bench --backend lm-studio --model qwen3-8b

# Benchmark all installed models
metrillm bench --all

# Share your result (upload + public URL + leaderboard rank)
metrillm bench --share

# CI/non-interactive mode
metrillm bench --ci-no-menu --share

# Force unload after each model (useful for memory isolation)
metrillm bench --all --unload-after-bench

# Export results locally
metrillm bench --export json
metrillm bench --export csv

Runtime Backends

Backend	Flag	Default URL	Required env
Ollama	`--backend ollama`	`http://127.0.0.1:11434`	`OLLAMA_HOST` (optional)
LM Studio	`--backend lm-studio`	`http://127.0.0.1:1234`	`LM_STUDIO_BASE_URL` (optional), `LM_STUDIO_API_KEY` (optional), `LM_STUDIO_STREAM_STALL_TIMEOUT_MS` (optional)

For very large models, tune timeout flags:

--perf-warmup-timeout-ms (default 300000)
--perf-prompt-timeout-ms (default 120000)
--quality-timeout-ms (default 120000)
--coding-timeout-ms (default 240000)
--lm-studio-stream-stall-timeout-ms (default 180000, 0 disables stall timeout)

How Scoring Works

Hardware Fit Score (0-100) — how well the model runs on your machine:

Speed: 40% (tokens/sec relative to your hardware tier)
TTFT: 30% (time to first token)
Memory: 30% (RAM efficiency)

Quality Score (0-100) — how well the model answers:

Reasoning: 20pts | Coding: 20pts | Instruction Following: 20pts
Structured Output: 15pts | Math: 15pts | Multilingual: 10pts

Global Score = 30% Hardware Fit + 70% Quality

Hardware is auto-detected and scoring adapts to your tier (Entry/Balanced/High-End). A model hitting 10 tok/s on a 8GB machine scores differently than on a 64GB rig.

Full methodology →

Share Your Results

Every benchmark you share enriches the public leaderboard. No account needed — pick the method that fits your workflow:

Method	Command / Action	Best for
CLI	`metrillm bench --share`	Terminal users
MCP	Call `share_result` tool	AI coding assistants
Plugin	`/benchmark` skill with share option	Claude Code / Cursor

All methods produce the same result:

A public URL for your benchmark
Your rank: "Top X% globally, Top Y% on [your CPU]"
A share card for social media
A challenge link to send to friends

Compare your results on the leaderboard →

MCP Server

Use MetriLLM from Claude Code, Cursor, Windsurf, or any MCP client — no CLI needed.

# Claude Code
claude mcp add metrillm -- npx metrillm-mcp@latest

# Claude Desktop / Cursor / Windsurf — add to MCP config:
# { "command": "npx", "args": ["metrillm-mcp@latest"] }

Tool	Description
`list_models`	List locally available LLM models
`run_benchmark`	Run full benchmark (performance + quality) on a model
`get_results`	Retrieve previous benchmark results
`share_result`	Upload a result to the public leaderboard

Full MCP documentation →

Skills

Slash commands that work inside AI coding assistants — no server needed, just a Markdown file.

Skill	Trigger	Description
`/benchmark`	User-invoked	Run a full benchmark interactively
`metrillm-guide`	Auto-invoked	Contextual guidance on model selection and results

Skills are included in the plugins below, or can be installed standalone:

# Claude Code
cp -r plugins/claude-code/skills/* ~/.claude/skills/

# Cursor
cp -r plugins/cursor/skills/* ~/.cursor/skills/

Plugins

Pre-built bundles (MCP + skills + agents) for deeper IDE integration.

Component	Description
MCP config	Auto-connects to `metrillm-mcp` server
Skills	`/benchmark` + `metrillm-guide`
Agent	`benchmark-advisor` — analyzes your hardware and recommends models

Install:

# Claude Code
cp -r plugins/claude-code/.claude/* ~/.claude/

# Cursor
cp -r plugins/cursor/.cursor/* ~/.cursor/

See Claude Code plugin and Cursor plugin for details.

Integrations

Integration	Package	Status	Docs
CLI	`metrillm`	Stable	Usage
MCP Server	`metrillm-mcp`	Stable	MCP docs
Skills	—	Stable	Skills
Claude Code plugin	—	Stable	Plugin docs
Cursor plugin	—	Stable	Plugin docs

Development

npm ci
npm run ci:verify     # typecheck + tests + build
npm run dev           # run from source
npm run test:watch    # vitest watch mode

Homebrew Formula Maintenance

The tap formula lives in Formula/metrillm.rb.

# Refresh Formula/metrillm.rb with latest npm tarball + sha256
./scripts/update-homebrew-formula.sh

# Or pin a specific version
./scripts/update-homebrew-formula.sh 0.1.1

After updating the formula, commit and push so users can install/update with:

brew tap MetriLLM/metrillm
brew install metrillm
brew upgrade metrillm

Contributing

Contributions are welcome! Please read the Contributing Guide before submitting a pull request. All commits must include a DCO sign-off.

License

Apache License 2.0 — see NOTICE for trademark information.

metrillm

Quick Install