Generate QA datasets & evaluate RAG systems in 2 commands

🔒 Privacy-First • ⚡ Lightning Fast • 🤖 Any LLM • 🏠 Local or Cloud

⚡ 2-Line RAG Evaluation

# Step 1: Generate QA pairs from your docs
ragscore generate docs/

# Step 2: Evaluate your RAG system
ragscore evaluate http://localhost:8000/query

That's it. Get accuracy scores and incorrect QA pairs instantly.

============================================================
✅ EXCELLENT: 85/100 correct (85.0%)
Average Score: 4.20/5.0
============================================================

❌ 15 Incorrect Pairs:

  1. Q: "What is RAG?"
     Score: 2/5 - Factually incorrect

  2. Q: "How does retrieval work?"
     Score: 3/5 - Incomplete answer

🚀 Quick Start

Install

pip install ragscore              # Core (works with Ollama)
pip install "ragscore[openai]"    # + OpenAI support
pip install "ragscore[notebook]"  # + Jupyter/Colab support
pip install "ragscore[all]"       # + All providers

Option 1: Python API (Notebook-Friendly)

Perfect for Jupyter, Colab, and rapid iteration. Get instant visualizations.

from ragscore import quick_test

# 1. Audit your RAG in one line
result = quick_test(
    endpoint="http://localhost:8000/query",  # Your RAG API
    docs="docs/",                            # Your documents
    n=10,                                    # Number of test questions
)

# 2. See the report
result.plot()

# 3. Inspect failures
bad_rows = result.df[result.df['score'] < 3]
display(bad_rows[['question', 'rag_answer', 'reason']])

Rich Object API:

result.accuracy - Accuracy score
result.df - Pandas DataFrame of all results
result.plot() - 3-panel visualization
result.corrections - List of items to fix

Option 2: CLI (Production)

Generate QA Pairs

# Set API key (or use local Ollama - no key needed!)
export OPENAI_API_KEY="sk-..."

# Generate from any document
ragscore generate paper.pdf
ragscore generate docs/*.pdf --concurrency 10

Evaluate Your RAG

# Point to your RAG endpoint
ragscore evaluate http://localhost:8000/query

# Custom options
ragscore evaluate http://api/ask --model gpt-4o --output results.json

🏠 100% Private with Local LLMs

# Use Ollama - no API keys, no cloud, 100% private
ollama pull llama3.1
ragscore generate confidential_docs/*.pdf
ragscore evaluate http://localhost:8000/query

Perfect for: Healthcare 🏥 • Legal ⚖️ • Finance 🏦 • Research 🔬

Ollama Model Recommendations

RAGScore generates complex structured QA pairs (question + answer + rationale + support span) in JSON format. This requires models with strong instruction-following and JSON output capabilities.

Model	Size	Min RAM	QA Quality	Recommended
`llama3.1:70b`	40GB	48GB VRAM	Excellent	GPU server (A100, L40)
`qwen2.5:32b`	18GB	24GB VRAM	Excellent	GPU server (A10, L20)
`llama3.1:8b`	4.7GB	8GB VRAM	Good	Best local choice
`qwen2.5:7b`	4.4GB	8GB VRAM	Good	Good local alternative
`mistral:7b`	4.1GB	8GB VRAM	Good	Good local alternative
`llama3.2:3b`	2.0GB	4GB RAM	Fair	CPU-only / testing
`qwen2.5:1.5b`	1.0GB	2GB RAM	Poor	Not recommended

Minimum recommended: 8B+ models. Smaller models (1.5B–3B) produce lower quality support spans and may timeout on longer chunks.

Ollama Performance Guide

# Recommended: 8B model with concurrency 2 for local machines
ollama pull llama3.1:8b
ragscore generate docs/ --provider ollama --model llama3.1:8b

# GPU server (A10/L20): larger model with higher concurrency
ollama pull qwen2.5:32b
ragscore generate docs/ --provider ollama --model qwen2.5:32b --concurrency 5

Expected performance (28 chunks, 5 QA pairs per chunk):

Hardware	Model	Time	Concurrency
MacBook (CPU)	llama3.2:3b	~45 min	2
MacBook (CPU)	llama3.1:8b	~25 min	2
A10 (24GB)	llama3.1:8b	~3–5 min	5
L20/L40 (48GB)	qwen2.5:32b	~3–5 min	5
OpenAI API	gpt-4o-mini	~2 min	10

RAGScore auto-reduces concurrency to 2 for local Ollama to avoid GPU/CPU contention.

🔌 Supported LLMs

Provider	Setup	Notes
Ollama	`ollama serve`	Local, free, private
OpenAI	`export OPENAI_API_KEY="sk-..."`	Best quality
Anthropic	`export ANTHROPIC_API_KEY="..."`	Long context
DashScope	`export DASHSCOPE_API_KEY="..."`	Qwen models
vLLM	`export LLM_BASE_URL="..."`	Production-grade
Any OpenAI-compatible	`export LLM_BASE_URL="..."`	Groq, Together, etc.

📊 Output Formats

Generated QA Pairs (`output/generated_qas.jsonl`)

{
  "id": "abc123",
  "question": "What is RAG?",
  "answer": "RAG (Retrieval-Augmented Generation) combines...",
  "rationale": "This is explicitly stated in the introduction...",
  "support_span": "RAG systems retrieve relevant documents...",
  "difficulty": "medium",
  "source_path": "docs/rag_intro.pdf"
}

Evaluation Results (`--output results.json`)

{
  "summary": {
    "total": 100,
    "correct": 85,
    "incorrect": 15,
    "accuracy": 0.85,
    "avg_score": 4.2
  },
  "incorrect_pairs": [
    {
      "question": "What is RAG?",
      "golden_answer": "RAG combines retrieval with generation...",
      "rag_answer": "RAG is a database system.",
      "score": 2,
      "reason": "Factually incorrect - RAG is not a database"
    }
  ]
}

🧪 Python API

from ragscore import run_pipeline, run_evaluation

# Generate QA pairs
run_pipeline(paths=["docs/"], concurrency=10)

# Evaluate RAG
results = run_evaluation(
    endpoint="http://localhost:8000/query",
    model="gpt-4o",  # LLM for judging
)
print(f"Accuracy: {results.accuracy:.1%}")

🤖 AI Agent Integration

RAGScore is designed for AI agents and automation:

# Structured CLI with predictable output
ragscore generate docs/ --concurrency 5
ragscore evaluate http://api/query --output results.json

# Exit codes: 0 = success, 1 = error
# JSON output for programmatic parsing

CLI Reference:

Command	Description
`ragscore generate <paths>`	Generate QA pairs from documents
`ragscore evaluate <endpoint>`	Evaluate RAG against golden QAs
`ragscore --help`	Show all commands and options
`ragscore generate --help`	Show generate options
`ragscore evaluate --help`	Show evaluate options

⚙️ Configuration

Zero config required. Optional environment variables:

export RAGSCORE_CHUNK_SIZE=512          # Chunk size for documents
export RAGSCORE_QUESTIONS_PER_CHUNK=5   # QAs per chunk
export RAGSCORE_WORK_DIR=/path/to/dir   # Working directory

🔐 Privacy & Security

Data	Cloud LLM	Local LLM
Documents	✅ Local	✅ Local
Text chunks	⚠️ Sent to LLM	✅ Local
Generated QAs	✅ Local	✅ Local
Evaluation results	✅ Local	✅ Local

Compliance: GDPR ✅ • HIPAA ✅ (with local LLMs) • SOC 2 ✅

🧪 Development

git clone https://github.com/HZYAI/RagScore.git
cd RagScore
pip install -e ".[dev,all]"
pytest

🔗 Links

GitHub • PyPI • Issues • Discussions

⭐ Star us on GitHub if RAGScore helps you!
Made with ❤️ for the RAG community

RAGScore

Quick Install

⚡ 2-Line RAG Evaluation

🚀 Quick Start

Install

Option 1: Python API (Notebook-Friendly)

Option 2: CLI (Production)

Generate QA Pairs

Evaluate Your RAG

🏠 100% Private with Local LLMs

Ollama Model Recommendations

Ollama Performance Guide

🔌 Supported LLMs

📊 Output Formats

Generated QA Pairs (`output/generated_qas.jsonl`)

Evaluation Results (`--output results.json`)

🧪 Python API

🤖 AI Agent Integration

⚙️ Configuration

🔐 Privacy & Security

🧪 Development

🔗 Links

Reviews

RAGScore

Quick Install

⚡ 2-Line RAG Evaluation

🚀 Quick Start

Install

Option 1: Python API (Notebook-Friendly)

Option 2: CLI (Production)

Generate QA Pairs

Evaluate Your RAG

🏠 100% Private with Local LLMs

Ollama Model Recommendations

Ollama Performance Guide

🔌 Supported LLMs

📊 Output Formats

Generated QA Pairs (output/generated_qas.jsonl)

Evaluation Results (--output results.json)

🧪 Python API

🤖 AI Agent Integration

⚙️ Configuration

🔐 Privacy & Security

🧪 Development

🔗 Links

Reviews

Generated QA Pairs (`output/generated_qas.jsonl`)

Evaluation Results (`--output results.json`)