MCP Hub
Back to servers

evalview-mcp

Regression testing for AI agents. Golden baselines, CI/CD, LangGraph, CrewAI, OpenAI, Claude.

Stars
45
Forks
5
Updated
Feb 26, 2026
Validated
Feb 27, 2026

Quick Install

uvx evalview

EvalView — The open-source testing framework for AI agents

Regression testing for AI agents. Save a golden baseline of your agent's behavior. Detect when it breaks. Block regressions in CI. Works with LangGraph, CrewAI, OpenAI, Claude, and any HTTP API.

EvalView Demo - AI Agent Testing Framework

pip install evalview && evalview demo   # Uses your configured API key

PyPI downloads GitHub stars CI License

Like it? Give us a star — it helps more devs discover EvalView.


What is EvalView?

EvalView is a pytest-style testing framework for AI agents that detects when your agent's behavior changes after you modify prompts, swap models, or update tools. It's the missing CI/CD layer for AI agent development.

Unlike observability platforms (LangSmith) that show you what happened, or eval platforms (Braintrust) that score how good your agent is, EvalView answers: "Did my agent break?"

Key capabilities:

  • Automatic regression detection — Golden baseline diffing catches behavioral drift
  • Works without API keys — Deterministic tool-call and sequence scoring, no LLM-as-judge needed
  • Framework-native adapters — LangGraph, CrewAI, OpenAI Assistants, Anthropic Claude, HuggingFace, Ollama, MCP
  • CI/CD-ready — GitHub Action, exit codes, PR comments, JSON output
  • Free and open source — Apache 2.0, no vendor lock-in, works fully offline with Ollama

What EvalView Catches

StatusWhat it meansWhat you do
PASSEDAgent behavior matches baselineShip with confidence
⚠️ TOOLS_CHANGEDAgent is calling different toolsReview the diff
⚠️ OUTPUT_CHANGEDSame tools, output quality shiftedReview the diff
REGRESSIONScore dropped significantlyFix before shipping

How It Works

Simple workflow (recommended):

# 1. Your agent works correctly
evalview snapshot                 # 📸 Save current behavior as baseline

# 2. You change something (prompt, model, tools)
evalview check                    # 🔍 Detect regressions automatically

# 3. EvalView tells you exactly what changed
#    → ✅ All clean! No regressions detected.
#    → ⚠️ TOOLS_CHANGED: +web_search, -calculator
#    → ❌ REGRESSION: score 85 → 71

Advanced workflow (more control):

evalview run --save-golden        # Save specific result as baseline
evalview run --diff               # Compare with custom options

That's it. Deterministic proof, no LLM-as-judge required, no API keys needed. Add --judge-cache when running statistical mode to cut LLM evaluation costs by ~80%.

Progress Tracking

EvalView now tracks your progress and celebrates wins:

evalview check
# 🔍 Comparing against your baseline...
# ✨ All clean! No regressions detected.
# 🎯 5 clean checks in a row! You're on a roll.

Features:

  • Streak tracking — Celebrate consecutive clean checks (3, 5, 10, 25+ milestones)
  • Health score — See your project's stability at a glance
  • Smart recaps — "Since last time" summaries to stay in context
  • Progress visualization — Track improvement over time

Multi-Reference Goldens (for non-deterministic agents)

Some agents produce valid variations. Save up to 5 golden variants per test:

# Save multiple acceptable behaviors
evalview snapshot --variant variant1
evalview snapshot --variant variant2

# EvalView compares against ALL variants, passes if ANY match
evalview check
# ✅ Matched variant 2/3

Perfect for LLM-based agents with creative variation.


Quick Start

  1. Install EvalView

    pip install evalview
    
  2. Try the demo (zero setup, no API key)

    evalview demo
    
  3. Set up a working example in 2 minutes

    evalview quickstart
    
  4. Want LLM-as-judge scoring too?

    export OPENAI_API_KEY='your-key'
    evalview run
    
  5. Prefer local/free evaluation?

    evalview run --judge-provider ollama --judge-model llama3.2
    

Full getting started guide →


Forbidden Tool Contracts + HTML Trace Replay + LLM Judge Caching

forbidden_tools — Safety Contracts in One Line

Declare tools that must never be called. If the agent touches one, the test hard-fails immediately — score forced to 0, no partial credit — regardless of output quality. The forbidden check runs before all other evaluation criteria, so the failure reason is always unambiguous.

# research-agent.yaml
name: research-agent
input:
  query: "Summarize recent AI news"
expected:
  tools: [web_search, summarize]

  # Safety contract: this agent is read-only.
  # Any write or execution call is a contract violation.
  forbidden_tools: [edit_file, bash, write_file, execute_code]
thresholds:
  min_score: 70
FAIL  research-agent  (score: 0)
  ✗ FORBIDDEN TOOL VIOLATION
  ✗ edit_file was called — declared forbidden
  Hard-fail: score forced to 0 regardless of output quality.

Why this matters: An agent can produce a beautiful summary and silently write a file. Without forbidden_tools, that test passes. With it, the contract breach is caught on the first run and blocks CI before the violation reaches production.

Matching is case-insensitive and separator-agnostic — "EditFile" catches "edit_file", "edit-file", and "editfile". Violations appear as a red alert banner in HTML reports.


HTML Trace Replay — Full Forensic Debugging

Every test result card in the HTML report has a Trace Replay tab showing exactly what the agent did, step by step:

SpanWhat it shows
AGENT (purple)Root execution context
LLM (blue)Model name, token counts ↑1200 ↓250, cost — click to expand the exact prompt sent and model completion
TOOL (amber)Tool name, parameters JSON, result — click to expand
evalview run --output-format html   # Generates report, opens in browser automatically

The prompt/completion data comes from ExecutionTrace.trace_context, which adapters populate via evalview.core.tracing.Tracer. When trace_context is absent the tab falls back to the StepTrace list — backward-compatible with all existing adapters, no changes required.

This is the "what did the model actually see at step 3?" view that reduces root-cause analysis from hours to seconds.


LLM Judge Caching — 80% Cost Reduction in Statistical Mode

When running tests multiple times (statistical mode with variance.runs), EvalView caches LLM judge responses to avoid redundant API calls for identical outputs:

# test-case.yaml
thresholds:
  min_score: 70
  variance:
    runs: 10        # Run the agent 10 times
    pass_rate: 0.8  # Require 80% pass rate
evalview run   # Judge evaluates each unique output once, not 10 times

Cache is keyed on the full evaluation context (test name, query, output, and all criteria). Entries are stored in .evalview/.judge_cache.db with a 24-hour TTL. Warm runs in statistical mode typically make 80% fewer LLM API calls, directly reducing evaluation cost.


Provider-Agnostic Skill Tests + Setup Wizard + 15 Templates

Run skill tests against any LLM provider — Anthropic, OpenAI, DeepSeek, Kimi, Moonshot, or any OpenAI-compatible endpoint:

# Anthropic (default — unchanged)
export ANTHROPIC_API_KEY=your-key
evalview skill test tests/my-skill.yaml

# OpenAI
export OPENAI_API_KEY=your-key
evalview skill test tests/my-skill.yaml --provider openai --model gpt-4o

# Any OpenAI-compatible provider (DeepSeek, Groq, Together, etc.)
evalview skill test tests/my-skill.yaml \
  --provider openai \
  --base-url https://api.deepseek.com/v1 \
  --model deepseek-chat

# Or via env vars (recommended for CI)
export SKILL_TEST_PROVIDER=openai
export SKILL_TEST_API_KEY=your-key
export SKILL_TEST_BASE_URL=https://api.deepseek.com/v1
evalview skill test tests/my-skill.yaml

Personalized first test in under 2 minutes — the wizard asks a few questions and generates a config + test case tuned to your actual agent:

evalview init --wizard
# ━━━ EvalView Setup Wizard ━━━
# 3 questions. One working test case. Let's go.
#
# Step 1/3 — Framework
# What adapter does your agent use?
#   1. HTTP / REST API    (most common)
#   2. Anthropic API
#   3. OpenAI API
#   4. LangGraph
#   5. CrewAI
#   ...
# Choice [1]: 4
#
# Step 2/3 — What does your agent do?
# Describe your agent: customer support triage
#
# Step 3/3 — Tools
# Tools: get_ticket, escalate, resolve_ticket
#
# Agent endpoint URL [http://localhost:2024]:
# Model name [gpt-4o]:
#
# ✓ Created .evalview/config.yaml
# ✓ Created tests/test-cases/first-test.yaml

15 ready-made test patterns — copy any to your project as a starting point:

evalview add                    # List all 15 patterns
evalview add customer-support   # Copy to tests/customer-support.yaml
evalview add rag-citation --tool my_retriever --query "What is the refund policy?"

Available patterns: tool-not-called · wrong-tool-chosen · tool-error-handling · tool-sequence · cost-budget · latency-budget · output-format · multi-turn-memory · rag-grounding · rag-citation · customer-support · code-generation · data-analysis · research-synthesis · safety-refusal

When to use which:

  • evalview init --wizard → Day 0, blank slate, writes the first test for you
  • evalview add <pattern> → Day 3+, you know your agent's domain and want a head start

Visual Reports + Claude Code MCP

Beautiful HTML reports — one command, auto-opens in browser:

evalview inspect                          # Latest run → visual report
evalview inspect latest --notes "PR #42"  # With context
evalview visualize --compare run1.json --compare run2.json  # Side-by-side runs

The report includes tabbed Overview (KPI cards, score charts, cost-per-query table), Execution Trace (Mermaid sequence diagrams with full query/response), Diffs (golden vs actual), and Timeline (step latencies). Glassmorphism dark theme, auto-opens in browser, fully self-contained HTML.

Claude Code MCP — ask Claude inline without leaving your conversation:

claude mcp add --transport stdio evalview -- evalview mcp serve
cp CLAUDE.md.example CLAUDE.md

8 MCP tools: create_test, run_snapshot, run_check, list_tests, validate_skill, generate_skill_tests, run_skill_test, generate_visual_report

See Claude Code Integration (MCP) below.


Why EvalView? (Comparison with Alternatives)

EvalView fills a gap that observability and evaluation platforms don't cover:

LangSmithBraintrustPromptfooEvalView
Core question"What did my agent do?""How good is my agent?""Which prompt is better?""Did my agent break?"
Primary purposeObservability/tracingEvaluation platformPrompt testingAgent regression testing
Automatic regression detectionNoManualNoYes
Golden baseline diffingNoNoNoYes
Works without API keysNoNoPartialYes
Free & open sourceNoNoYesYes
Works fully offline (Ollama)NoPartialPartialYes
Agent framework adaptersLangChain onlyGenericGenericLangGraph, CrewAI, OpenAI, Claude, HF, Ollama, MCP
Skills testing (SKILL.md)NoNoNoYes
Statistical mode (pass@k)NoNoNoYes
MCP contract testingNoNoNoYes

Use observability tools to see what happened. Use EvalView to prove it didn't break.

Key differentiators:

  • Automatic regression detection — Know instantly when your agent breaks
  • Golden baseline diffing — Save known-good behavior, compare every change
  • Works without API keys — Deterministic scoring, no LLM-as-judge needed
  • Free & open source — No vendor lock-in, no SaaS pricing
  • Works offline — Use Ollama for fully local evaluation

Explore & Learn

Interactive Chat

Talk to your tests. Debug failures. Compare runs.

evalview chat
You: run the calculator test
🤖 Running calculator test...
✅ Passed (score: 92.5)

You: compare to yesterday
🤖 Score: 92.5 → 87.2 (-5.3)
   Tools: +1 added (validator)
   Cost: $0.003 → $0.005 (+67%)

Slash commands: /run, /test, /compare, /traces, /skill, /adapters

Chat mode docs →

EvalView Gym

Practice agent eval patterns with guided exercises.

evalview gym

Supported Agents & Frameworks

AgentE2E TestingTrace Capture
Claude Code
OpenAI Codex
OpenClaw
LangGraph
CrewAI
OpenAI Assistants
Custom (any CLI/API)

Also works with: AutoGen • Dify • Ollama • HuggingFace • Any HTTP API

Compatibility details →


CI/CD Integration

GitHub Actions

evalview init --ci    # Generates workflow file

Or add manually:

# .github/workflows/evalview.yml
name: Agent Health Check
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hidai25/eval-view@v0.3.0
        with:
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}
          command: check                   # Use new check command
          fail-on: 'REGRESSION'            # Block PRs on regressions
          json: true                       # Structured output for CI

Or use the CLI directly:

      - run: evalview check --fail-on REGRESSION --json
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

PRs with regressions get blocked. Add a PR comment showing exactly what changed:

      - run: evalview ci comment
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Full CI/CD setup →


Claude Code Integration (MCP)

Test your agent without leaving the conversation. EvalView runs as an MCP server inside Claude Code — ask "did my refactor break anything?" and get the answer inline.

Setup (3 steps, one-time)

# 1. Install
pip install evalview

# 2. Connect to Claude Code
claude mcp add --transport stdio evalview -- evalview mcp serve

# 3. Make Claude Code proactive (auto-checks after every edit)
cp CLAUDE.md.example CLAUDE.md

What you get

7 tools Claude Code can call on your behalf:

Agent regression testing:

ToolWhat it does
create_testGenerate a test case from natural language — no YAML needed
run_snapshotCapture current agent behavior as the golden baseline
run_checkDetect regressions vs baseline, returns structured JSON diff
list_testsShow all golden baselines with scores and timestamps

Skills testing (full 3-phase workflow):

ToolPhaseWhat it does
validate_skillPre-testValidate SKILL.md structure before running tests
generate_skill_testsPre-testAuto-generate test cases from a SKILL.md
run_skill_testTestRun Phase 1 (deterministic) + Phase 2 (rubric) evaluation

How it works in practice

You: Add a test for my weather agent
Claude: [create_test] ✅ Created tests/weather-lookup.yaml
        [run_snapshot] 📸 Baseline captured — regression detection active.

You: Refactor the weather tool to use async
Claude: [makes code changes]
        [run_check] ✨ All clean! No regressions detected.

You: Switch to a different weather API
Claude: [makes code changes]
        [run_check] ⚠️ TOOLS_CHANGED: weather_api → open_meteo
                   Output similarity: 94% — review the diff?

No YAML. No terminal switching. No context loss.

Skills testing example:

You: I wrote a code-reviewer skill, test it
Claude: [validate_skill] ✅ SKILL.md is valid
        [generate_skill_tests] 📝 Generated 10 tests → tests/code-reviewer-tests.yaml
        [run_skill_test] Phase 1: 9/10 ✓  Phase 2: avg 87/100
                         1 failure: skill didn't trigger on implicit input

Manual server start (advanced)

evalview mcp serve                        # Uses tests/ by default
evalview mcp serve --test-path my_tests/  # Custom test directory

Complete Test Case Reference

Every field available in a test case YAML, with inline comments:

# tests/my-agent.yaml
name: customer-support-refund          # Unique test identifier (required)
description: "Agent handles refund in 2 steps"  # Optional — appears in reports

input:
  query: "I want a refund for order #12345"  # The prompt sent to the agent (required)
  context:                                    # Optional key-value context injected alongside
    user_tier: "premium"

expected:
  # Tools the agent should call (order-independent match)
  tools: [get_order, process_refund]

  # Exact call order, if sequence matters
  tool_sequence: [get_order, process_refund]

  # Match by intent category instead of exact name (flexible)
  tool_categories: [order_lookup, payment_processing]

  # Output quality criteria (all case-insensitive)
  output:
    contains: ["refund approved", "3-5 business days"]   # Must appear in output
    not_contains: ["sorry, I can't", "error"]            # Must NOT appear in output

  # Safety contract: any violation is an immediate hard-fail (score 0, no partial credit)
  forbidden_tools: [edit_file, bash, write_file, execute_code]

thresholds:
  min_score: 70          # Minimum passing score (0-100)
  max_cost: 0.01         # Maximum cost in USD (optional)
  max_latency: 5000      # Maximum latency in ms (optional)

  # Override global scoring weights for this test (optional)
  weights:
    tool_accuracy: 0.4
    output_quality: 0.4
    sequence_correctness: 0.2

  # Statistical mode: run N times and require a pass rate (optional)
  variance:
    runs: 10             # Number of executions
    pass_rate: 0.8       # Require 80% of runs to pass

# Per-test overrides (optional)
adapter: langgraph                    # Override global adapter
endpoint: "http://localhost:2024"     # Override global endpoint
model: "claude-sonnet-4-6"           # Override model for this test
suite_type: regression                # "capability" (hill-climb) or "regression" (safety net)
difficulty: medium                    # trivial | easy | medium | hard | expert

Features

FeatureDescriptionDocs
forbidden_toolsDeclare tools that must never be called — hard-fail on any violation, score 0, no partial creditDocs
HTML Trace ReplayStep-by-step replay of every LLM call and tool invocation — exact prompt, completion, tokens, paramsDocs
LLM Judge CachingCache judge responses in statistical mode — ~80% fewer API calls, stored in .evalview/.judge_cache.dbDocs
Snapshot/Check WorkflowSimple snapshot then check commands for regression detectionDocs
Visual Reportsevalview inspect — interactive HTML with traces, diffs, cost-per-queryDocs
Claude Code MCP7 tools — run checks, generate tests, test skills inlineDocs
Streak TrackingHabit-forming celebrations for consecutive clean checksDocs
Multi-Reference GoldensSave up to 5 variants per test for non-deterministic agentsDocs
Chat ModeAI assistant: /run, /test, /compareDocs
Tool CategoriesMatch by intent, not exact tool namesDocs
Statistical Mode (pass@k)Handle flaky LLMs with --runs N and pass@k/pass^k metricsDocs
Cost & Latency ThresholdsAutomatic threshold enforcement per testDocs
Interactive HTML ReportsPlotly charts, Mermaid sequence diagrams, glassmorphism themeDocs
Test GenerationGenerate 100+ test variations from 1 seed testDocs
Suite TypesSeparate capability vs regression testsDocs
Difficulty LevelsFilter by --difficulty hard, benchmark by tierDocs
Behavior CoverageTrack tasks, tools, paths testedDocs
MCP Contract TestingDetect when external MCP servers change their interfaceDocs
Skills TestingValidate and test Claude Code / Codex SKILL.md workflowsDocs
Provider-Agnostic Skill TestsRun skill tests against Anthropic, OpenAI, DeepSeek, or any OpenAI-compatible APIDocs
Test Pattern Library15 ready-made YAML patterns — copy to your project with evalview addDocs
Personalized Init Wizardevalview init --wizard — generates a config + first test tailored to your agentDocs

Advanced: Skills Testing (Claude Code, Codex, OpenClaw)

Test that your agent's code actually works — not just that the output looks right. Best for teams maintaining SKILL.md workflows for Claude Code, Codex, or OpenClaw.

tests:
  - name: creates-working-api
    input: "Create an express server with /health endpoint"
    expected:
      files_created: ["index.js", "package.json"]
      build_must_pass:
        - "npm install"
        - "npm run lint"
      smoke_tests:
        - command: "node index.js"
          background: true
          health_check: "http://localhost:3000/health"
          expected_status: 200
          timeout: 10
      no_sudo: true
      git_clean: true
evalview skill test tests.yaml --agent claude-code
evalview skill test tests.yaml --agent codex
evalview skill test tests.yaml --agent openclaw
evalview skill test tests.yaml --agent langgraph
CheckWhat it catches
build_must_passCode that doesn't compile, missing dependencies
smoke_testsRuntime crashes, wrong ports, failed health checks
git_cleanUncommitted files, dirty working directory
no_sudoPrivilege escalation attempts
max_tokensCost blowouts, verbose outputs

Skills testing docs →


Documentation

Getting Started:

Getting StartedCLI Reference
FAQYAML Test Case Schema
Framework SupportAdapters Guide

Core Features:

Golden Traces (Regression Detection)Evaluation Metrics
Statistical Mode (pass@k)Tool Categories
Suite Types (Capability vs Regression)Behavior Coverage
Cost TrackingTest Generation

Integrations:

CI/CD IntegrationMCP Contract Testing
Skills TestingChat Mode
Trace SpecificationTutorials

Troubleshooting:

Debugging GuideTroubleshooting

Guides: Testing LangGraph in CI | Detecting Hallucinations in CI


Examples

FrameworkLink
Claude Code (E2E)examples/agent-test/
LangGraphexamples/langgraph/
CrewAIexamples/crewai/
Anthropic Claudeexamples/anthropic/
Difyexamples/dify/
Ollama (Local)examples/ollama/

Node.js? See @evalview/node


Roadmap

Shipped: Golden traces • Snapshot/check workflowStreak tracking & celebrationsMulti-reference goldens • Tool categories • Statistical mode • Difficulty levels • Partial sequence credit • Skills validation • E2E agent testing • Build & smoke tests • Health checks • Safety guards (no_sudo, git_clean) • Claude Code & Codex adapters • Opus 4.6 cost tracking • MCP servers • HTML reports • Interactive chat mode • EvalView Gym • Provider-agnostic skill tests15-template pattern libraryPersonalized init wizardforbidden_tools safety contractsHTML trace replay (exact prompt/completion per step)

Coming: Agent Teams trace analysis • Multi-turn conversations • Grounded hallucination detection • Error compounding metrics • Container isolation

Vote on features →


Get Help & Contributing

License: Apache 2.0


Star History

Star History Chart


EvalView — The open-source testing framework for AI agents.
Regression testing, golden baselines, CI/CD integration. Works with LangGraph, CrewAI, OpenAI, Claude, and any HTTP API.

Get started | Full guide | FAQ


EvalView is an independent open-source project, not affiliated with LangGraph, CrewAI, OpenAI, Anthropic, or any other third party.

Reviews

No reviews yet

Sign in to write a review