Generate QA datasets & evaluate RAG systems in 2 commands
🔒 Privacy-First • ⚡ Lightning Fast • 🤖 Any LLM • 🏠 Local or Cloud
⚡ 2-Line RAG Evaluation
# Step 1: Generate QA pairs from your docs
ragscore generate docs/
# Step 2: Evaluate your RAG system
ragscore evaluate http://localhost:8000/query
That's it. Get accuracy scores and incorrect QA pairs instantly.
============================================================
✅ EXCELLENT: 85/100 correct (85.0%)
Average Score: 4.20/5.0
============================================================
❌ 15 Incorrect Pairs:
1. Q: "What is RAG?"
Score: 2/5 - Factually incorrect
2. Q: "How does retrieval work?"
Score: 3/5 - Incomplete answer
🚀 Quick Start
Install
pip install ragscore # Core (works with Ollama)
pip install "ragscore[openai]" # + OpenAI support
pip install "ragscore[notebook]" # + Jupyter/Colab support
pip install "ragscore[all]" # + All providers
Option 1: Python API (Notebook-Friendly)
Perfect for Jupyter, Colab, and rapid iteration. Get instant visualizations.
from ragscore import quick_test
# 1. Audit your RAG in one line
result = quick_test(
endpoint="http://localhost:8000/query", # Your RAG API
docs="docs/", # Your documents
n=10, # Number of test questions
)
# 2. See the report
result.plot()
# 3. Inspect failures
bad_rows = result.df[result.df['score'] < 3]
display(bad_rows[['question', 'rag_answer', 'reason']])
Rich Object API:
result.accuracy- Accuracy scoreresult.df- Pandas DataFrame of all resultsresult.plot()- 3-panel visualizationresult.corrections- List of items to fix
Option 2: CLI (Production)
Generate QA Pairs
# Set API key (or use local Ollama - no key needed!)
export OPENAI_API_KEY="sk-..."
# Generate from any document
ragscore generate paper.pdf
ragscore generate docs/*.pdf --concurrency 10
Evaluate Your RAG
# Point to your RAG endpoint
ragscore evaluate http://localhost:8000/query
# Custom options
ragscore evaluate http://api/ask --model gpt-4o --output results.json
🏠 100% Private with Local LLMs
# Use Ollama - no API keys, no cloud, 100% private
ollama pull llama3.1
ragscore generate confidential_docs/*.pdf
ragscore evaluate http://localhost:8000/query
Perfect for: Healthcare 🏥 • Legal ⚖️ • Finance 🏦 • Research 🔬
Ollama Model Recommendations
RAGScore generates complex structured QA pairs (question + answer + rationale + support span) in JSON format. This requires models with strong instruction-following and JSON output capabilities.
| Model | Size | Min RAM | QA Quality | Recommended |
|---|---|---|---|---|
llama3.1:70b | 40GB | 48GB VRAM | Excellent | GPU server (A100, L40) |
qwen2.5:32b | 18GB | 24GB VRAM | Excellent | GPU server (A10, L20) |
llama3.1:8b | 4.7GB | 8GB VRAM | Good | Best local choice |
qwen2.5:7b | 4.4GB | 8GB VRAM | Good | Good local alternative |
mistral:7b | 4.1GB | 8GB VRAM | Good | Good local alternative |
llama3.2:3b | 2.0GB | 4GB RAM | Fair | CPU-only / testing |
qwen2.5:1.5b | 1.0GB | 2GB RAM | Poor | Not recommended |
Minimum recommended: 8B+ models. Smaller models (1.5B–3B) produce lower quality support spans and may timeout on longer chunks.
Ollama Performance Guide
# Recommended: 8B model with concurrency 2 for local machines
ollama pull llama3.1:8b
ragscore generate docs/ --provider ollama --model llama3.1:8b
# GPU server (A10/L20): larger model with higher concurrency
ollama pull qwen2.5:32b
ragscore generate docs/ --provider ollama --model qwen2.5:32b --concurrency 5
Expected performance (28 chunks, 5 QA pairs per chunk):
| Hardware | Model | Time | Concurrency |
|---|---|---|---|
| MacBook (CPU) | llama3.2:3b | ~45 min | 2 |
| MacBook (CPU) | llama3.1:8b | ~25 min | 2 |
| A10 (24GB) | llama3.1:8b | ~3–5 min | 5 |
| L20/L40 (48GB) | qwen2.5:32b | ~3–5 min | 5 |
| OpenAI API | gpt-4o-mini | ~2 min | 10 |
RAGScore auto-reduces concurrency to 2 for local Ollama to avoid GPU/CPU contention.
🔌 Supported LLMs
| Provider | Setup | Notes |
|---|---|---|
| Ollama | ollama serve | Local, free, private |
| OpenAI | export OPENAI_API_KEY="sk-..." | Best quality |
| Anthropic | export ANTHROPIC_API_KEY="..." | Long context |
| DashScope | export DASHSCOPE_API_KEY="..." | Qwen models |
| vLLM | export LLM_BASE_URL="..." | Production-grade |
| Any OpenAI-compatible | export LLM_BASE_URL="..." | Groq, Together, etc. |
📊 Output Formats
Generated QA Pairs (output/generated_qas.jsonl)
{
"id": "abc123",
"question": "What is RAG?",
"answer": "RAG (Retrieval-Augmented Generation) combines...",
"rationale": "This is explicitly stated in the introduction...",
"support_span": "RAG systems retrieve relevant documents...",
"difficulty": "medium",
"source_path": "docs/rag_intro.pdf"
}
Evaluation Results (--output results.json)
{
"summary": {
"total": 100,
"correct": 85,
"incorrect": 15,
"accuracy": 0.85,
"avg_score": 4.2
},
"incorrect_pairs": [
{
"question": "What is RAG?",
"golden_answer": "RAG combines retrieval with generation...",
"rag_answer": "RAG is a database system.",
"score": 2,
"reason": "Factually incorrect - RAG is not a database"
}
]
}
🧪 Python API
from ragscore import run_pipeline, run_evaluation
# Generate QA pairs
run_pipeline(paths=["docs/"], concurrency=10)
# Evaluate RAG
results = run_evaluation(
endpoint="http://localhost:8000/query",
model="gpt-4o", # LLM for judging
)
print(f"Accuracy: {results.accuracy:.1%}")
🤖 AI Agent Integration
RAGScore is designed for AI agents and automation:
# Structured CLI with predictable output
ragscore generate docs/ --concurrency 5
ragscore evaluate http://api/query --output results.json
# Exit codes: 0 = success, 1 = error
# JSON output for programmatic parsing
CLI Reference:
| Command | Description |
|---|---|
ragscore generate <paths> | Generate QA pairs from documents |
ragscore evaluate <endpoint> | Evaluate RAG against golden QAs |
ragscore --help | Show all commands and options |
ragscore generate --help | Show generate options |
ragscore evaluate --help | Show evaluate options |
⚙️ Configuration
Zero config required. Optional environment variables:
export RAGSCORE_CHUNK_SIZE=512 # Chunk size for documents
export RAGSCORE_QUESTIONS_PER_CHUNK=5 # QAs per chunk
export RAGSCORE_WORK_DIR=/path/to/dir # Working directory
🔐 Privacy & Security
| Data | Cloud LLM | Local LLM |
|---|---|---|
| Documents | ✅ Local | ✅ Local |
| Text chunks | ⚠️ Sent to LLM | ✅ Local |
| Generated QAs | ✅ Local | ✅ Local |
| Evaluation results | ✅ Local | ✅ Local |
Compliance: GDPR ✅ • HIPAA ✅ (with local LLMs) • SOC 2 ✅
🧪 Development
git clone https://github.com/HZYAI/RagScore.git
cd RagScore
pip install -e ".[dev,all]"
pytest
🔗 Links
- GitHub • PyPI • Issues • Discussions
⭐ Star us on GitHub if RAGScore helps you!
Made with ❤️ for the RAG community