k8s-gpu-mcp-server
Just-in-Time SRE Diagnostic Agent for NVIDIA GPU Clusters on Kubernetes
Overview
k8s-gpu-mcp-server is an ephemeral diagnostic agent that provides surgical,
real-time NVIDIA GPU hardware introspection for Kubernetes clusters via the
Model Context Protocol (MCP).
Unlike traditional monitoring systems, this agent is designed for AI-assisted troubleshooting by SREs debugging complex hardware failures that standard Kubernetes APIs cannot detect.
✨ Key Features
- 🎯 On-Demand Diagnostics - Agent runs only during
kubectl execsessions - 🔌 Stdio Transport - JSON-RPC 2.0 over
kubectl debugSPDY tunneling - 🔍 Deep Hardware Access - Direct NVML integration for GPU diagnostics
- 🤖 AI-Native - Built for Claude Desktop, Cursor, and MCP-compatible hosts
- 🔒 Secure by Default - Read-only operations with explicit operator mode
- ⚡ Production Ready - Real Tesla T4 testing, 74/74 tests passing
🚀 Quick Start
One-Click Install
Click the button above to install automatically in Cursor.
One-Line Installation
# Using npx (recommended)
npx k8s-gpu-mcp-server@latest
# Or install globally
npm install -g k8s-gpu-mcp-server
📋 Manual Configuration: Cursor / VS Code
Add to ~/.cursor/mcp.json (Cursor) or VS Code MCP config:
{
"mcpServers": {
"k8s-gpu-mcp": {
"command": "npx",
"args": ["-y", "k8s-gpu-mcp-server@latest"]
}
}
}
📋 Manual Configuration: Claude Desktop
macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json
{
"mcpServers": {
"k8s-gpu-mcp": {
"command": "npx",
"args": ["-y", "k8s-gpu-mcp-server@latest"]
}
}
}
Install from Source
# Clone and build
git clone https://github.com/ArangoGutierrez/k8s-gpu-mcp-server.git
cd k8s-gpu-mcp-server
make agent
# Test with mock GPUs (no hardware required)
cat examples/gpu_inventory.json | ./bin/agent --nvml-mode=mock
# Test with real GPU (requires NVIDIA driver)
cat examples/gpu_inventory.json | ./bin/agent --nvml-mode=real
Deploy to Kubernetes
# Deploy with Helm (RuntimeClass mode - recommended)
helm install k8s-gpu-mcp-server ./deployment/helm/k8s-gpu-mcp-server \
--namespace gpu-diagnostics --create-namespace
# Find agent pod on target node
NODE_NAME=<node-name>
POD=$(kubectl get pods -n gpu-diagnostics \
-l app.kubernetes.io/name=k8s-gpu-mcp-server \
--field-selector spec.nodeName=$NODE_NAME \
-o jsonpath='{.items[0].metadata.name}')
# Start diagnostic session
kubectl exec -it -n gpu-diagnostics $POD -- /agent --mode=read-only
Note: GPU access requires
runtimeClassName: nvidiaconfigured by GPU Operator or nvidia-ctk. For clusters without RuntimeClass, use fallback:--set gpu.runtimeClass.enabled=false --set gpu.resourceRequest.enabled=true
Configure Claude Desktop with kubectl (Advanced)
For deployed agents, add to your Claude Desktop configuration:
{
"mcpServers": {
"k8s-gpu-agent": {
"command": "kubectl",
"args": ["exec", "-i", "deploy/k8s-gpu-mcp-server", "-n", "gpu-diagnostics", "--", "/agent"]
}
}
}
Then ask Claude: "What's the temperature of the GPUs?"
📊 Architecture
┌──────────────┐ kubectl debug ┌────────────────┐
│ Claude │ ──────────────────────> │ K8s Node │
│ Desktop │ SPDY Stdio Tunnel │ ┌──────────┐ │
└──────────────┘ │ │ Agent │ │
▲ │ │ (stdio) │ │
│ JSON-RPC 2.0 │ └────┬─────┘ │
│ MCP Protocol │ │ │
└──────────────────────────────────│ ┌───▼────┐ │
│ │ NVML │ │
│ │ API │ │
│ └───┬────┘ │
│ │ │
│ GPU 0...N │
└───────────────┘
Design Principles:
- "Syringe Pattern": Ephemeral injection, zero idle footprint
- Stdio-Only: No network listeners, firewall-friendly
- Interface Abstraction: Testable, flexible, portable
📖 Architecture Documentation →
🛠️ Available Tools
| Tool | Description | Status |
|---|---|---|
get_gpu_inventory | Hardware inventory + telemetry | ✅ Available |
analyze_xid_errors | Parse GPU XID error codes from kernel logs | ✅ Available |
get_gpu_health | GPU health monitoring with scoring | ✅ Available |
get_gpu_telemetry | Real-time metrics | 🚧 M2 Phase 3 |
inspect_topology | NVLink/PCIe topology | 🚧 M2 Phase 4 |
kill_gpu_process | Terminate GPU process | 🚧 M3 (Operator) |
reset_gpu | GPU reset | 🚧 M3 (Operator) |
📈 Project Status
Current Milestone: M2: Hardware Introspection
Due: January 17, 2026
Progress: Phase 1 Complete (Real NVML ✅)
Completed Milestones
- ✅ M1: Foundation & API - Completed Jan 3, 2026
- Go module scaffolding
- MCP stdio server
- Mock NVML implementation
- Comprehensive CI/CD
Recent Updates
- Jan 4, 2026: GPU health monitoring tool (
get_gpu_health) merged - Jan 3, 2026: XID error analysis tool (
analyze_xid_errors) merged - Jan 3, 2026: Real NVML integration complete, tested on Tesla T4
- Jan 3, 2026: 74/74 tests passing, 5/5 integration tests on real GPU
🧪 Testing
Unit Tests (No GPU Required)
make test # Run all unit tests (74/74 passing)
make coverage # Generate coverage report
make coverage-html # View coverage in browser
Integration Tests (Requires GPU)
make test-integration # Run on GPU hardware
# Or manually:
go test -tags=integration -v ./pkg/nvml/
Latest Test Results on Tesla T4:
✓ TestRealNVML_Integration
- GPU: Tesla T4 (15GB)
- Temperature: 29°C
- Power: 13.9W
- Utilization: 0% (idle)
✓ 5/5 integration tests passing
✓ 74/74 total tests passing
🏗️ Build
# Build for local platform
make agent
# Build for Linux (with real NVML)
CGO_ENABLED=1 GOOS=linux GOARCH=amd64 make agent
# Build container image
make image
# Multi-arch release builds
make dist
Binary Sizes:
- Mock mode: 4.3MB (CGO disabled)
- Real mode: 7.9MB (CGO enabled)
📦 Installation
Using npm (Recommended)
# Run directly with npx
npx k8s-gpu-mcp-server@latest
# Or install globally
npm install -g k8s-gpu-mcp-server
From Source
git clone https://github.com/ArangoGutierrez/k8s-gpu-mcp-server.git
cd k8s-gpu-mcp-server
make agent
sudo mv bin/agent /usr/local/bin/k8s-gpu-mcp-server
Using Go
go install github.com/ArangoGutierrez/k8s-gpu-mcp-server/cmd/agent@latest
Container Image (Coming in M3)
docker pull ghcr.io/arangogutierrez/k8s-gpu-mcp-server:latest
🤝 Contributing
We welcome contributions! Please see our Development Guide for details.
Quick Contribution Guide
- Check open issues
- Fork and create feature branch:
git checkout -b feat/my-feature - Make changes, add tests
- Run checks:
make all - Commit with DCO:
git commit -s -S -m "feat(scope): description" - Open PR with labels and milestone
📚 Documentation
- Quick Start Guide - Get running in 5 minutes
- Architecture - System design and components
- MCP Usage - How to consume the MCP server
- Development Guide - Contributing guidelines
- Examples - Sample JSON-RPC requests
🔧 Technology Stack
- Language: Go 1.25+ (latest stable)
- MCP Protocol: mcp-go v0.43.2
- GPU Library: go-nvml v0.13.0-1
- Testing: testify v1.10.0
- Container: Distroless Debian 12 (coming in M3)
🎯 Use Cases
1. Debugging Stuck Training Jobs
SRE: "Why is the training job on node-5 stuck?"
Claude → k8s-gpu-mcp-server → Detects XID 48 (ECC Error)
Claude: "Node-5 has uncorrectable memory errors. Drain immediately."
2. Thermal Management
SRE: "Are any GPUs thermal throttling?"
Claude → k8s-gpu-mcp-server → Checks temps and throttle status
Claude: "GPU 3 is at 86°C and thermal throttling. Check cooling."
3. Topology Validation
SRE: "Is NVLink properly configured for multi-GPU training?"
Claude → k8s-gpu-mcp-server → Inspects NVLink topology
Claude: "All 8 GPUs connected via NVLink, 600GB/s bandwidth."
4. Zombie Process Hunting
SRE: "GPU memory is full but no pods are running"
Claude → k8s-gpu-mcp-server → Lists GPU processes
Claude: "Found zombie process PID 12345 using 8GB. Kill it?"
🏆 Achievements
- ✅ Go 1.25 - Latest Go version
- ✅ Real NVML - Tested on Tesla T4
- ✅ 74/74 Tests - 100% passing with race detector
- ✅ Zero Lint Issues - Clean codebase
- ✅ 7.9MB Binary - 84% under 50MB target
- ✅ MCP 2025-06-18 - Latest protocol version
- ✅ Production Ready - Used on real hardware
📄 License
Apache License 2.0 - See LICENSE for details.
🙏 Acknowledgments
- NVIDIA NVML - GPU Management Library
- Model Context Protocol - MCP Specification
- mcp-go - MCP Go Implementation
- Anthropic Claude - AI Assistant
- Cursor - AI-Powered IDE
📞 Contact
Maintainer: @ArangoGutierrez
Issues: GitHub Issues
Discussions: GitHub Discussions
⭐ Star us on GitHub — it helps!