MCP Hub
Back to servers

tokio-prompt-orchestrator

Multi-core, Tokio-native orchestration for LLM pipelines.

GitHub
Stars
47
Forks
4
Updated
Mar 10, 2026
Validated
Mar 11, 2026

tokio-prompt-orchestrator

CI Rust 1.85+ License: MIT unsafe forbidden Tests

Multi-core, Tokio-native orchestration for LLM inference pipelines. Built with backpressure, circuit breakers, deduplication, and an autonomous self-improving control loop.

We ran 24 Claude Code agents simultaneously on a single RTX 4070. They wrote this codebase in one night (58,000+ lines of Rust, 1,491 tests, zero panics) while the orchestrator they were building managed their own inference traffic. The dedup layer collapsed redundant context across agents into single inference calls, cutting total API consumption to 26% of expected usage across all 24 agents over 90 minutes.

Anthropic's documented ceiling is 16 concurrent agents. We hit 24. The bottleneck was the API rate limit, not the orchestrator.

Star History Chart


Windows .exe — No Install Required

Download orchestrator.exe from the releases page and double-click it. No Rust, no dependencies, nothing to install.

Step 1 — Get an API key (one-time, ~2 minutes)

You need a key from whichever AI provider you want to use:

ProviderWhere to get a keyStarting cost
Anthropic (recommended)https://console.anthropic.com/settings/keys~$5 credit
OpenAIhttps://platform.openai.com/api-keys~$5 credit
llama.cppNo key needed — runs locallyFree
echoNo key needed — offline test modeFree

Step 2 — Run the wizard

The first time you open orchestrator.exe it walks you through setup:

  ┌──────────────────────────────────────────────────┐
  │       Welcome to tokio-prompt-orchestrator        │
  │  Setup takes about 60 seconds.                    │
  └──────────────────────────────────────────────────┘

  Which AI provider do you want to use?

  1) Anthropic  (Claude — recommended for most users)
     Get a key: https://console.anthropic.com/settings/keys

  2) OpenAI     (GPT-4o and friends)
     Get a key: https://platform.openai.com/api-keys

  3) llama      (run models locally — no key needed)
  4) echo       (test mode — no key, no internet)

  Enter 1, 2, 3, or 4 [1]: 1

  Paste your key here (sk-ant-…): sk-ant-xxxxx
  ✓ Key saved.

  Model name (press Enter to use default) [claude-sonnet-4-6]: ↵
  Port for the web API [8080]: ↵

  ✓ All done! Settings saved to orchestrator.env next to this .exe.
    Run with --reset any time to change your key or provider.

Your key is stored only in orchestrator.env on your machine — never transmitted anywhere except to the provider you chose.

Step 3 — Start using it

After setup the banner appears and you can type prompts straight into the terminal:

╔══════════════════════════════════════════════════╗
║   tokio-prompt-orchestrator v0.1.0               ║
║  Provider : anthropic (claude-sonnet-4-6)        ║
║  Web API  : http://127.0.0.1:8080                ║
╠══════════════════════════════════════════════════╣
║  HOW TO USE                                      ║
║                                                  ║
║  Terminal  ─ type any question below             ║
║                                                  ║
║  Agents / IDEs ─ HTTP while this window is open: ║
║   POST http://127.0.0.1:8080/v1/prompt           ║
║   WS   ws://127.0.0.1:8080/v1/stream             ║
║                                                  ║
║  Claude Desktop ─ claude_desktop_config.json:    ║
║   mcpServers > orchestrator > url:               ║
║     "http://127.0.0.1:8080"                      ║
╚══════════════════════════════════════════════════╝

Ask me anything — try: What can you help me with?
Type 'exit' to quit.

> What can you help me with?

I can answer questions, write and review code, summarise documents,
help plan projects, and more — routed through a resilience pipeline
with automatic retries and deduplication.

>

The terminal REPL and the web API run at the same time — you type in the window while agents and IDEs connect to http://127.0.0.1:8080 in the background.

Connecting agents and IDEs

Claude Desktop — add to claude_desktop_config.json and restart Claude Desktop:

{
  "mcpServers": {
    "orchestrator": {
      "url": "http://127.0.0.1:8080"
    }
  }
}

VS Code / Cursor / Windsurf — point your AI extension's custom endpoint at http://127.0.0.1:8080/v1/prompt

curl:

curl -X POST http://127.0.0.1:8080/v1/prompt \
  -H "Content-Type: application/json" \
  -d '{"input": "What is the capital of France?"}'

Python / any script:

import requests
r = requests.post("http://127.0.0.1:8080/v1/prompt",
                  json={"input": "Summarise this codebase"})
print(r.json()["text"])

Changing your key or provider

Run orchestrator.exe --reset — it wipes the saved settings and runs the wizard again. Or edit orchestrator.env directly (it's a plain text file next to the .exe).


Library Usage

# Cargo.toml
[dependencies]
tokio-prompt-orchestrator = { git = "https://github.com/Mattbusel/tokio-prompt-orchestrator" }
tokio = { version = "1", features = ["full"] }
use tokio_prompt_orchestrator::{Pipeline, PipelineConfig, EchoWorker, InferenceRequest};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Build a pipeline with an echo worker.
    // Swap in OpenAIWorker, AnthropicWorker, or LlamaCppWorker for real inference.
    let config = PipelineConfig::default();
    let worker = EchoWorker::new();
    let pipeline = Pipeline::new(config, worker).await?;

    // Dedup, circuit breaker, and retry all happen transparently.
    let req = InferenceRequest::new("Explain backpressure in one sentence.");
    let response = pipeline.infer(req).await?;

    println!("{}", response.text);
    Ok(())
}

See examples/ for full working examples: OpenAI, Anthropic, llama.cpp, vLLM, REST, WebSocket, SSE streaming, and multi-worker setups.


Live Dashboard

TUI Dashboard

Real-time terminal dashboard: 985 requests in, 323 actual inferences (67.2% dedup collapse, $7.94 saved in 3 minutes). Circuit breakers across OpenAI, Anthropic, and llama.cpp all CLOSED. A latency spike at INFER p99=483ms caused the circuit to open; a probe was sent, and recovery was confirmed in 15 seconds. Built with ratatui.


What It Does

A five-stage async pipeline for LLM inference with a closed-loop self-improving control system. Bounded channels enforce backpressure end-to-end. Every request flows through:

RAG -> Assemble -> Inference -> Post-Process -> Stream

The self-improving stack runs alongside the pipeline as a live background service. It observes telemetry, detects anomalies, tunes parameters, and records outcomes without human intervention.


Architecture

Pipeline -- Bounded async channels (RAG->ASM: 512, ASM->INF: 512, INF->PST: 1024, PST->STR: 256). Session affinity via deterministic sharding. Requests are shed rather than blocked when queues fill.

Resilience -- Circuit breaker (opens at 5 failures, closes at 80% success rate, 60s timeout). Retry with exponential backoff and jitter. Token-bucket rate limiting. Priority queues with 4 levels. In-memory and Redis caching with TTL.

Self-Improving Loop -- SelfImprovementLoop runs as a single background Tokio task:

TelemetryBus
    |  broadcast snapshot every 5s
    v
AnomalyDetector  ---- Z-score + CUSUM ---> TuningController
                                                |  PID adjusts 12 params
                                                v
MetaTaskGenerator <------------------- SnapshotStore
    |  triggered on Warning/Critical       records best configs
    v
ValidationGate  (cargo test + clippy)
    v
AgentMemory  (outcome recorded, dead ends flagged)

Start it: cargo run --bin coordinator --features self-improving -- --self-improve

Intelligence Layer -- IntelligenceBridge wires six learned systems into the pipeline:

  • LearnedRouter -- epsilon-greedy multi-armed bandit, routes by observed quality
  • Autoscaler -- predictive, scales worker capacity before demand arrives
  • FeedbackCollector -- aggregates quality signals from post-processing
  • QualityEstimator -- scores response quality at inference time
  • PromptOptimizer -- rewrites prompts to minimize token spend
  • SemanticDedup -- embedding-based dedup that collapses semantically equivalent prompts before they reach the provider

These run as a closed loop: feedback scores update the router, the router changes load distribution, and the autoscaler reacts.

HelixRouter Integration -- Three self_tune modules bridge the orchestrator to HelixRouter in real time. helix_probe polls /api/stats and converts queue depth, drop rate, and latency into a pressure signal. helix_config_pusher writes PID-derived adjustments back via PATCH /api/config. helix_feedback forwards quality scores from the intelligence layer as routing hints.

Capability Discovery -- CapabilityDiscovery runs cargo check --message-format=json to detect dead code and cargo audit --json to surface CVEs. Findings become tasks for the agent fleet.

Distributed -- NATS pub/sub for inter-node messaging. Redis-based cross-node dedup via atomic SET NX EX. Leader election with TTL renewal. Cluster manager with heartbeat tracking.

Coordination -- Agent fleet management. Atomic task claiming via filesystem locks. Priority ordering from TOML task files. Configurable health monitoring.

Routing -- Complexity scoring routes simple prompts to local llama.cpp and complex ones to cloud APIs. Adaptive thresholds tune themselves on success/failure feedback. Per-model cost tracking with budget awareness.

Observability -- Prometheus metrics, TUI terminal dashboard, web dashboard with SSE streaming, structured JSON tracing. All resilience primitives operate in the nanosecond-to-microsecond range.

MCP -- infer, pipeline_status, batch_infer, and configure_pipeline are all callable from Claude Desktop and Claude Code, with live stage latency reporting.


Quick Start

git clone https://github.com/Mattbusel/tokio-prompt-orchestrator.git
cd tokio-prompt-orchestrator

# Demo (no API keys needed)
cargo run --bin orchestrator-demo

# TUI dashboard
cargo run --bin tui --features tui

# Full self-improving stack
cargo run --bin tui --features self-improving,tui

# Coordinator with self-improvement loop
cargo run --bin coordinator --features self-improving -- --self-improve

# Web dashboard at http://localhost:3000
cargo run --bin dashboard --features dashboard

# MCP server for Claude Desktop / Claude Code
cargo run --bin mcp --features mcp -- --worker echo

Feature Flags

self-tune        # PID controllers, telemetry bus, anomaly detection
self-modify      # Agent loop: task generation, validation gate, memory
intelligence     # LearnedRouter, Autoscaler, FeedbackCollector, QualityEstimator
evolution        # Snapshot store, A/B experiments, rollback
self-improving   # Full autonomous optimization stack (all of the above)
tui              # Terminal dashboard (ratatui)
mcp              # Claude MCP server (rmcp 0.16)
dashboard        # Web dashboard with SSE streaming
distributed      # NATS pub/sub + Redis clustering
full             # web-api, metrics-server, caching, rate-limiting, tui, dashboard

Environment Variables

OPENAI_API_KEY="sk-..."               # OpenAI worker
ANTHROPIC_API_KEY="sk-ant-..."        # Anthropic worker
LLAMA_CPP_URL="http://localhost:8080" # llama.cpp server
REDIS_URL="redis://localhost:6379"    # Redis (caching + distributed)
NATS_URL="nats://localhost:4222"      # NATS (distributed clustering)
RUST_LOG="info"
LOG_FORMAT="json"                     # Structured logs for Datadog/Loki

Numbers

Lines of code       58,457
Tests               1,491 passing, 0 failing
Benchmarks          30+ criterion, all within budget
Dedup savings       66.7% collapse rate in live demo
Dedup check         ~1.5us p50
Circuit breaker     ~0.4us p50 (closed path)
Cache hit           81ns
Rate limit check    110ns
Channel send        <1us p99

Why This vs LangChain / LlamaIndex / CrewAI

tokio-prompt-orchestratorLangChain / LlamaIndexCrewAI
LanguageRustPythonPython
Latency overheadNanosecondsMillisecondsMilliseconds
Memory safetyCompile-timeRuntimeRuntime
BackpressureBounded channels, shed on fullNot built-inNot built-in
Circuit breakerBuilt-inPlugin/manualNot built-in
Dedup collapse67% in productionNot built-inNot built-in
Self-improving loopLive, autonomousNot built-inNot built-in
ConcurrencyTokio async, zero-copyPython GILPython GIL
Binary size4 MB (LTO)100s of MB deps100s of MB deps

Use LangChain if you need to prototype quickly in Python. Use this if you need predictable latency, zero panics, and autonomous optimization at scale.


Why This Works

The inference cost problem is not solved at the model layer. It is solved at the orchestration layer. A production-grade Rust orchestrator can collapse 60-80% of LLM API spend through deduplication before any model optimization. The self-improving stack adds a control loop that tunes pipeline parameters continuously.

SelfImprovementLoop is a live background service. IntelligenceBridge is a closed feedback loop between quality estimation and routing. CapabilityDiscovery executes cargo check and cargo audit against the running codebase. The system improves itself while serving inference.

The infrastructure here -- bounded pipelines, adaptive routing, multi-model failover, MCP-native tooling, agent fleet coordination, autonomous optimization -- is the foundation every serious LLM deployment will need.


Safety

This crate uses #![forbid(unsafe_code)]. No unsafe blocks exist in production code paths. Lints clippy::unwrap_used, clippy::expect_used, and clippy::panic are set to deny. This policy held across 24 concurrent agents building the codebase in one session without a single runtime panic.

Minimum Supported Rust Version

Rust 1.85 or later is required. The MSRV is tested in CI on every push. Changes to the MSRV are treated as semver minor bumps.

License

MIT


Ecosystem

This orchestrator is built on a set of standalone Rust primitives. Each is independently usable:

CrateRole
tokio-agent-memoryAgent memory bus -- episodic recall for the self-improving loop
llm-budgetHard cost enforcement across the agent fleet
llm-syncCRDT state sync for distributed multi-node deployments
fin-streamLock-free ingestion pipeline -- same pattern as the inference queue

Related tools:

  • Every-Other-Token -- token-level LLM research tool whose telemetry feeds HelixRouter
  • agent-runtime -- unified Tokio agent runtime: orchestration, memory, knowledge graph, and ReAct loop
  • llm-cost-dashboard -- ratatui terminal dashboard for real-time LLM token spend

Related Projects

Reviews

No reviews yet

Sign in to write a review