MCP Hub
Back to servers

PromptThrift MCP

Enables 70-90% LLM API cost reduction by compressing conversation history via local Gemma 4 models or heuristics, featuring token counting, model routing, and pinned facts for preserving critical context.

glama
Updated
Apr 5, 2026

PromptThrift MCP — Smart Token Compression for LLM Apps

Cut 70-90% of your LLM API costs with intelligent conversation compression. Now with Gemma 4 local compression — smarter summaries, zero API cost.

License: MIT Python 3.10+ MCP Compatible Gemma 4

The Problem

Every LLM API call resends your entire conversation history. A 20-turn chat costs 6x more per call than a 3-turn one — you're paying for the same old messages over and over.

Turn 1:  ████ 700 tokens ($0.002)
Turn 5:  ████████████████ 4,300 tokens ($0.013)
Turn 20: ████████████████████████████████████████ 12,500 tokens ($0.038)
                                              ↑ You're paying for THIS every call

The Solution

PromptThrift is an MCP server with 4 tools to slash your API costs:

ToolWhat it doesImpact
promptthrift_compress_historyCompress old turns into a smart summary50-90% fewer input tokens
promptthrift_count_tokensTrack token usage & costs across 14 modelsKnow where money goes
promptthrift_suggest_modelRecommend cheapest model for the task60-80% on simple tasks
promptthrift_pin_factsPin critical facts that survive compressionNever lose key context

Why PromptThrift?

PromptThriftContext ModeHeadroom
LicenseMIT (commercial OK)ELv2 (no competing)Apache 2.0
Compression typeConversation memoryTool schema virtualizationTool output
Local LLM supportGemma 4 via OllamaNoNo
Cost trackingMulti-model comparisonNoNo
Model routingBuilt-inNoNo
Pinned factsNever-Compress ListNoNo

Quick Start

Install

git clone https://github.com/woling-dev/promptthrift-mcp.git
cd promptthrift-mcp
pip install -r requirements.txt

Optional: Enable Gemma 4 Compression

For smarter AI-powered compression (free, runs locally):

# Install Ollama: https://ollama.com
ollama pull gemma4:4b

PromptThrift auto-detects Ollama. If running → uses Gemma 4 for compression. If not → falls back to fast heuristic compression. Zero config needed.

Claude Desktop

Add to claude_desktop_config.json:

{
  "mcpServers": {
    "promptthrift": {
      "command": "python",
      "args": ["/path/to/promptthrift-mcp/server.py"]
    }
  }
}

Cursor / Windsurf

Add to your MCP settings:

{
  "mcpServers": {
    "promptthrift": {
      "command": "python",
      "args": ["/path/to/promptthrift-mcp/server.py"]
    }
  }
}

Real-World Example

A customer service bot handling olive oil product Q&A:

Before compression (sent every API call):

Q: Can I drink olive oil straight?
A: Yes! Our extra virgin is drinkable. We have 500ml and 1000ml.
Q: What's the difference between PET and glass bottles?
A: Glass is our premium line. 1000ml PET is for heavy cooking families.
Q: Which one do you recommend?
A: For drinking: Extra Virgin 500ml. For salads/cooking: 1000ml.
Q: I also do a lot of frying.
A: For high-heat frying, our Pure Olive Oil 500ml (230°C smoke point).

~250 tokens × every subsequent API call

After Gemma 4 compression:

[Compressed history]
Customer asks about olive oil products. Key facts:
- Extra virgin (500ml glass) for drinking, single-origin available
- 1000ml PET for cooking/salads (lower grade, family-size)
- Pure olive oil 500ml for high-heat frying (230°C smoke point)
[End compressed history]

~80 tokens — 68% saved on every call after this point

With 100 customers/day averaging 30 turns each on Claude Sonnet: ~$14/month saved from one bot.

Pinned Facts (Never-Compress List)

Some facts must never be lost during compression — user names, critical preferences, key decisions. Pin them:

You: "Pin the fact that this customer is allergic to nuts"

→ promptthrift_pin_facts(action="add", facts=["Customer is allergic to nuts"])
→ This fact will appear in ALL future compressed summaries, guaranteed.

Supported Models (April 2026 pricing)

ModelInput $/MTokOutput $/MTokLocal?
gemma-4-e2b$0.00$0.00Ollama
gemma-4-e4b$0.00$0.00Ollama
gemma-4-27b$0.00$0.00Ollama
gemini-2.0-flash$0.10$0.40
gpt-4.1-nano$0.10$0.40
gpt-4o-mini$0.15$0.60
gemini-2.5-flash$0.15$0.60
gpt-4.1-mini$0.40$1.60
claude-haiku-4.5$1.00$5.00
gemini-2.5-pro$1.25$10.00
gpt-4.1$2.00$8.00
gpt-4o$2.50$10.00
claude-sonnet-4.6$3.00$15.00
claude-opus-4.6$5.00$25.00

How It Works

Before (every API call sends ALL of this):
┌──────────────────────────────────┐
│ System prompt      (500 tokens)  │
│ Turn 1: user+asst  (600 tokens)  │  ← Repeated every call
│ Turn 2: user+asst  (600 tokens)  │  ← Repeated every call
│ ...                              │
│ Turn 8: user+asst  (600 tokens)  │  ← Repeated every call
│ Turn 9: user+asst  (new)         │
│ Turn 10: user      (new)         │
└──────────────────────────────────┘
Total: ~6,500 tokens per call

After PromptThrift compression:
┌──────────────────────────────────┐
│ System prompt      (500 tokens)  │
│ [Pinned facts]      (50 tokens)  │  ← Always preserved
│ [Compressed summary](200 tokens) │  ← Turns 1-8 in 200 tokens!
│ Turn 9: user+asst  (kept)        │
│ Turn 10: user      (kept)        │
└──────────────────────────────────┘
Total: ~1,750 tokens per call (73% saved!)

Compression Modes

ModeMethodQualitySpeedCost
HeuristicRule-based extractionGood (50-60% reduction)InstantFree
LLM (Gemma 4)AI-powered understandingExcellent (70-90% reduction)~2sFree (local)

PromptThrift automatically uses the best available method. Install Ollama + Gemma 4 for maximum compression quality.

Environment Variables

VariableRequiredDefaultDescription
PROMPTTHRIFT_OLLAMA_MODELNogemma4:4bOllama model for LLM compression
PROMPTTHRIFT_OLLAMA_URLNohttp://localhost:11434Ollama API endpoint
PROMPTTHRIFT_DEFAULT_MODELNoclaude-sonnet-4.6Default model for cost estimates

Security

  • All data processed locally by default — nothing leaves your machine
  • Ollama compression runs 100% on your hardware
  • Post-compression sanitizer strips prompt injection patterns from summaries
  • API keys read from environment variables only, never hardcoded
  • No persistent storage, no telemetry, no third-party calls

Roadmap

  • Heuristic conversation compression
  • Multi-model token counting (14 models)
  • Intelligent model routing
  • Gemma 4 local LLM compression via Ollama
  • Pinned facts (Never-Compress List)
  • Post-compression security sanitizer
  • Cloud-based compression (Anthropic/OpenAI API fallback)
  • Prompt caching optimization advisor
  • Web dashboard for usage analytics
  • VS Code extension

Contributing

PRs welcome! This project uses MIT license — fork it, improve it, ship it.

License

MIT License — Free for personal and commercial use.


Built by Woling Dev Lab

Star this repo if it saves you money!

Reviews

No reviews yet

Sign in to write a review