MCP Hub
Back to servers

RAGScore

Generate QA datasets & evaluate RAG systems. Privacy-first, any LLM, local or cloud.

Stars
28
Forks
4
Updated
Feb 11, 2026
Validated
Feb 13, 2026

Quick Install

uvx ragscore
RAGScore Logo

PyPI version PyPI Downloads Python 3.9+ License Ollama Open In Colab MCP

Generate QA datasets & evaluate RAG systems in 2 commands

🔒 Privacy-First • ⚡ Lightning Fast • 🤖 Any LLM • 🏠 Local or Cloud

English | 中文 | 日本語


⚡ 2-Line RAG Evaluation

# Step 1: Generate QA pairs from your docs
ragscore generate docs/

# Step 2: Evaluate your RAG system
ragscore evaluate http://localhost:8000/query

That's it. Get accuracy scores and incorrect QA pairs instantly.

============================================================
✅ EXCELLENT: 85/100 correct (85.0%)
Average Score: 4.20/5.0
============================================================

❌ 15 Incorrect Pairs:

  1. Q: "What is RAG?"
     Score: 2/5 - Factually incorrect

  2. Q: "How does retrieval work?"
     Score: 3/5 - Incomplete answer

🚀 Quick Start

Install

pip install ragscore              # Core (works with Ollama)
pip install "ragscore[openai]"    # + OpenAI support
pip install "ragscore[notebook]"  # + Jupyter/Colab support
pip install "ragscore[all]"       # + All providers

Option 1: Python API (Notebook-Friendly)

Perfect for Jupyter, Colab, and rapid iteration. Get instant visualizations.

from ragscore import quick_test

# 1. Audit your RAG in one line
result = quick_test(
    endpoint="http://localhost:8000/query",  # Your RAG API
    docs="docs/",                            # Your documents
    n=10,                                    # Number of test questions
)

# 2. See the report
result.plot()

# 3. Inspect failures
bad_rows = result.df[result.df['score'] < 3]
display(bad_rows[['question', 'rag_answer', 'reason']])

Rich Object API:

  • result.accuracy - Accuracy score
  • result.df - Pandas DataFrame of all results
  • result.plot() - 3-panel visualization
  • result.corrections - List of items to fix

Option 2: CLI (Production)

Generate QA Pairs

# Set API key (or use local Ollama - no key needed!)
export OPENAI_API_KEY="sk-..."

# Generate from any document
ragscore generate paper.pdf
ragscore generate docs/*.pdf --concurrency 10

Evaluate Your RAG

# Point to your RAG endpoint
ragscore evaluate http://localhost:8000/query

# Custom options
ragscore evaluate http://api/ask --model gpt-4o --output results.json

🏠 100% Private with Local LLMs

# Use Ollama - no API keys, no cloud, 100% private
ollama pull llama3.1
ragscore generate confidential_docs/*.pdf
ragscore evaluate http://localhost:8000/query

Perfect for: Healthcare 🏥 • Legal ⚖️ • Finance 🏦 • Research 🔬

Ollama Model Recommendations

RAGScore generates complex structured QA pairs (question + answer + rationale + support span) in JSON format. This requires models with strong instruction-following and JSON output capabilities.

ModelSizeMin RAMQA QualityRecommended
llama3.1:70b40GB48GB VRAMExcellentGPU server (A100, L40)
qwen2.5:32b18GB24GB VRAMExcellentGPU server (A10, L20)
llama3.1:8b4.7GB8GB VRAMGoodBest local choice
qwen2.5:7b4.4GB8GB VRAMGoodGood local alternative
mistral:7b4.1GB8GB VRAMGoodGood local alternative
llama3.2:3b2.0GB4GB RAMFairCPU-only / testing
qwen2.5:1.5b1.0GB2GB RAMPoorNot recommended

Minimum recommended: 8B+ models. Smaller models (1.5B–3B) produce lower quality support spans and may timeout on longer chunks.

Ollama Performance Guide

# Recommended: 8B model with concurrency 2 for local machines
ollama pull llama3.1:8b
ragscore generate docs/ --provider ollama --model llama3.1:8b

# GPU server (A10/L20): larger model with higher concurrency
ollama pull qwen2.5:32b
ragscore generate docs/ --provider ollama --model qwen2.5:32b --concurrency 5

Expected performance (28 chunks, 5 QA pairs per chunk):

HardwareModelTimeConcurrency
MacBook (CPU)llama3.2:3b~45 min2
MacBook (CPU)llama3.1:8b~25 min2
A10 (24GB)llama3.1:8b~3–5 min5
L20/L40 (48GB)qwen2.5:32b~3–5 min5
OpenAI APIgpt-4o-mini~2 min10

RAGScore auto-reduces concurrency to 2 for local Ollama to avoid GPU/CPU contention.


🔌 Supported LLMs

ProviderSetupNotes
Ollamaollama serveLocal, free, private
OpenAIexport OPENAI_API_KEY="sk-..."Best quality
Anthropicexport ANTHROPIC_API_KEY="..."Long context
DashScopeexport DASHSCOPE_API_KEY="..."Qwen models
vLLMexport LLM_BASE_URL="..."Production-grade
Any OpenAI-compatibleexport LLM_BASE_URL="..."Groq, Together, etc.

📊 Output Formats

Generated QA Pairs (output/generated_qas.jsonl)

{
  "id": "abc123",
  "question": "What is RAG?",
  "answer": "RAG (Retrieval-Augmented Generation) combines...",
  "rationale": "This is explicitly stated in the introduction...",
  "support_span": "RAG systems retrieve relevant documents...",
  "difficulty": "medium",
  "source_path": "docs/rag_intro.pdf"
}

Evaluation Results (--output results.json)

{
  "summary": {
    "total": 100,
    "correct": 85,
    "incorrect": 15,
    "accuracy": 0.85,
    "avg_score": 4.2
  },
  "incorrect_pairs": [
    {
      "question": "What is RAG?",
      "golden_answer": "RAG combines retrieval with generation...",
      "rag_answer": "RAG is a database system.",
      "score": 2,
      "reason": "Factually incorrect - RAG is not a database"
    }
  ]
}

🧪 Python API

from ragscore import run_pipeline, run_evaluation

# Generate QA pairs
run_pipeline(paths=["docs/"], concurrency=10)

# Evaluate RAG
results = run_evaluation(
    endpoint="http://localhost:8000/query",
    model="gpt-4o",  # LLM for judging
)
print(f"Accuracy: {results.accuracy:.1%}")

🤖 AI Agent Integration

RAGScore is designed for AI agents and automation:

# Structured CLI with predictable output
ragscore generate docs/ --concurrency 5
ragscore evaluate http://api/query --output results.json

# Exit codes: 0 = success, 1 = error
# JSON output for programmatic parsing

CLI Reference:

CommandDescription
ragscore generate <paths>Generate QA pairs from documents
ragscore evaluate <endpoint>Evaluate RAG against golden QAs
ragscore --helpShow all commands and options
ragscore generate --helpShow generate options
ragscore evaluate --helpShow evaluate options

⚙️ Configuration

Zero config required. Optional environment variables:

export RAGSCORE_CHUNK_SIZE=512          # Chunk size for documents
export RAGSCORE_QUESTIONS_PER_CHUNK=5   # QAs per chunk
export RAGSCORE_WORK_DIR=/path/to/dir   # Working directory

🔐 Privacy & Security

DataCloud LLMLocal LLM
Documents✅ Local✅ Local
Text chunks⚠️ Sent to LLM✅ Local
Generated QAs✅ Local✅ Local
Evaluation results✅ Local✅ Local

Compliance: GDPR ✅ • HIPAA ✅ (with local LLMs) • SOC 2 ✅


🧪 Development

git clone https://github.com/HZYAI/RagScore.git
cd RagScore
pip install -e ".[dev,all]"
pytest

🔗 Links


⭐ Star us on GitHub if RAGScore helps you!
Made with ❤️ for the RAG community

Reviews

No reviews yet

Sign in to write a review