MCP Hub
Back to servers

Kubernetes GPU Agent

A specialized MCP server for Kubernetes SREs providing real-time NVIDIA GPU hardware diagnostics, XID error analysis, and health monitoring through secure kubectl debug sessions.

Stars
2
Forks
1
Tools
7
Updated
Jan 9, 2026
Validated
Jan 11, 2026

k8s-gpu-mcp-server

Just-in-Time SRE Diagnostic Agent for NVIDIA GPU Clusters on Kubernetes

Go Version License CI Status GitHub Issues GitHub Stars MCP


Overview

k8s-gpu-mcp-server is an ephemeral diagnostic agent that provides surgical, real-time NVIDIA GPU hardware introspection for Kubernetes clusters via the Model Context Protocol (MCP).

Unlike traditional monitoring systems, this agent is designed for AI-assisted troubleshooting by SREs debugging complex hardware failures that standard Kubernetes APIs cannot detect.

✨ Key Features

  • 🎯 On-Demand Diagnostics - Agent runs only during kubectl exec sessions
  • 🔌 Stdio Transport - JSON-RPC 2.0 over kubectl debug SPDY tunneling
  • 🔍 Deep Hardware Access - Direct NVML integration for GPU diagnostics
  • 🤖 AI-Native - Built for Claude Desktop, Cursor, and MCP-compatible hosts
  • 🔒 Secure by Default - Read-only operations with explicit operator mode
  • Production Ready - Real Tesla T4 testing, 74/74 tests passing

🚀 Quick Start

One-Click Install

Install MCP Server

Click the button above to install automatically in Cursor.

One-Line Installation

# Using npx (recommended)
npx k8s-gpu-mcp-server@latest

# Or install globally
npm install -g k8s-gpu-mcp-server
📋 Manual Configuration: Cursor / VS Code

Add to ~/.cursor/mcp.json (Cursor) or VS Code MCP config:

{
  "mcpServers": {
    "k8s-gpu-mcp": {
      "command": "npx",
      "args": ["-y", "k8s-gpu-mcp-server@latest"]
    }
  }
}
📋 Manual Configuration: Claude Desktop

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json

{
  "mcpServers": {
    "k8s-gpu-mcp": {
      "command": "npx",
      "args": ["-y", "k8s-gpu-mcp-server@latest"]
    }
  }
}

Install from Source

# Clone and build
git clone https://github.com/ArangoGutierrez/k8s-gpu-mcp-server.git
cd k8s-gpu-mcp-server
make agent

# Test with mock GPUs (no hardware required)
cat examples/gpu_inventory.json | ./bin/agent --nvml-mode=mock

# Test with real GPU (requires NVIDIA driver)
cat examples/gpu_inventory.json | ./bin/agent --nvml-mode=real

Deploy to Kubernetes

# Deploy with Helm (RuntimeClass mode - recommended)
helm install k8s-gpu-mcp-server ./deployment/helm/k8s-gpu-mcp-server \
  --namespace gpu-diagnostics --create-namespace

# Find agent pod on target node
NODE_NAME=<node-name>
POD=$(kubectl get pods -n gpu-diagnostics \
  -l app.kubernetes.io/name=k8s-gpu-mcp-server \
  --field-selector spec.nodeName=$NODE_NAME \
  -o jsonpath='{.items[0].metadata.name}')

# Start diagnostic session
kubectl exec -it -n gpu-diagnostics $POD -- /agent --mode=read-only

Note: GPU access requires runtimeClassName: nvidia configured by GPU Operator or nvidia-ctk. For clusters without RuntimeClass, use fallback: --set gpu.runtimeClass.enabled=false --set gpu.resourceRequest.enabled=true

Configure Claude Desktop with kubectl (Advanced)

For deployed agents, add to your Claude Desktop configuration:

{
  "mcpServers": {
    "k8s-gpu-agent": {
      "command": "kubectl",
      "args": ["exec", "-i", "deploy/k8s-gpu-mcp-server", "-n", "gpu-diagnostics", "--", "/agent"]
    }
  }
}

Then ask Claude: "What's the temperature of the GPUs?"

📖 Full Quick Start Guide →


📊 Architecture

┌──────────────┐     kubectl debug      ┌────────────────┐
│   Claude     │ ──────────────────────> │  K8s Node      │
│   Desktop    │   SPDY Stdio Tunnel     │  ┌──────────┐  │
└──────────────┘                         │  │  Agent   │  │
       ▲                                 │  │  (stdio) │  │
       │         JSON-RPC 2.0             │  └────┬─────┘  │
       │         MCP Protocol             │       │        │
       └──────────────────────────────────│   ┌───▼────┐  │
                                          │   │  NVML  │  │
                                          │   │  API   │  │
                                          │   └───┬────┘  │
                                          │       │       │
                                          │   GPU 0...N   │
                                          └───────────────┘

Design Principles:

  • "Syringe Pattern": Ephemeral injection, zero idle footprint
  • Stdio-Only: No network listeners, firewall-friendly
  • Interface Abstraction: Testable, flexible, portable

📖 Architecture Documentation →


🛠️ Available Tools

ToolDescriptionStatus
get_gpu_inventoryHardware inventory + telemetry✅ Available
analyze_xid_errorsParse GPU XID error codes from kernel logs✅ Available
get_gpu_healthGPU health monitoring with scoring✅ Available
get_gpu_telemetryReal-time metrics🚧 M2 Phase 3
inspect_topologyNVLink/PCIe topology🚧 M2 Phase 4
kill_gpu_processTerminate GPU process🚧 M3 (Operator)
reset_gpuGPU reset🚧 M3 (Operator)

📖 MCP Usage Guide →


📈 Project Status

Current Milestone: M2: Hardware Introspection

Due: January 17, 2026
Progress: Phase 1 Complete (Real NVML ✅)

Completed Milestones

  • M1: Foundation & API - Completed Jan 3, 2026
    • Go module scaffolding
    • MCP stdio server
    • Mock NVML implementation
    • Comprehensive CI/CD

Recent Updates

  • Jan 4, 2026: GPU health monitoring tool (get_gpu_health) merged
  • Jan 3, 2026: XID error analysis tool (analyze_xid_errors) merged
  • Jan 3, 2026: Real NVML integration complete, tested on Tesla T4
  • Jan 3, 2026: 74/74 tests passing, 5/5 integration tests on real GPU

📊 View All Milestones →


🧪 Testing

Unit Tests (No GPU Required)

make test                   # Run all unit tests (74/74 passing)
make coverage               # Generate coverage report
make coverage-html          # View coverage in browser

Integration Tests (Requires GPU)

make test-integration       # Run on GPU hardware
# Or manually:
go test -tags=integration -v ./pkg/nvml/

Latest Test Results on Tesla T4:

✓ TestRealNVML_Integration
  - GPU: Tesla T4 (15GB)
  - Temperature: 29°C
  - Power: 13.9W
  - Utilization: 0% (idle)

✓ 5/5 integration tests passing
✓ 74/74 total tests passing

🏗️ Build

# Build for local platform
make agent

# Build for Linux (with real NVML)
CGO_ENABLED=1 GOOS=linux GOARCH=amd64 make agent

# Build container image
make image

# Multi-arch release builds
make dist

Binary Sizes:

  • Mock mode: 4.3MB (CGO disabled)
  • Real mode: 7.9MB (CGO enabled)

📦 Installation

Using npm (Recommended)

# Run directly with npx
npx k8s-gpu-mcp-server@latest

# Or install globally
npm install -g k8s-gpu-mcp-server

From Source

git clone https://github.com/ArangoGutierrez/k8s-gpu-mcp-server.git
cd k8s-gpu-mcp-server
make agent
sudo mv bin/agent /usr/local/bin/k8s-gpu-mcp-server

Using Go

go install github.com/ArangoGutierrez/k8s-gpu-mcp-server/cmd/agent@latest

Container Image (Coming in M3)

docker pull ghcr.io/arangogutierrez/k8s-gpu-mcp-server:latest

🤝 Contributing

We welcome contributions! Please see our Development Guide for details.

Quick Contribution Guide

  1. Check open issues
  2. Fork and create feature branch: git checkout -b feat/my-feature
  3. Make changes, add tests
  4. Run checks: make all
  5. Commit with DCO: git commit -s -S -m "feat(scope): description"
  6. Open PR with labels and milestone

📖 Full Development Guide →


📚 Documentation


🔧 Technology Stack


🎯 Use Cases

1. Debugging Stuck Training Jobs

SRE: "Why is the training job on node-5 stuck?"
Claude → k8s-gpu-mcp-server → Detects XID 48 (ECC Error)
Claude: "Node-5 has uncorrectable memory errors. Drain immediately."

2. Thermal Management

SRE: "Are any GPUs thermal throttling?"
Claude → k8s-gpu-mcp-server → Checks temps and throttle status
Claude: "GPU 3 is at 86°C and thermal throttling. Check cooling."

3. Topology Validation

SRE: "Is NVLink properly configured for multi-GPU training?"
Claude → k8s-gpu-mcp-server → Inspects NVLink topology
Claude: "All 8 GPUs connected via NVLink, 600GB/s bandwidth."

4. Zombie Process Hunting

SRE: "GPU memory is full but no pods are running"
Claude → k8s-gpu-mcp-server → Lists GPU processes
Claude: "Found zombie process PID 12345 using 8GB. Kill it?"

🏆 Achievements

  • Go 1.25 - Latest Go version
  • Real NVML - Tested on Tesla T4
  • 74/74 Tests - 100% passing with race detector
  • Zero Lint Issues - Clean codebase
  • 7.9MB Binary - 84% under 50MB target
  • MCP 2025-06-18 - Latest protocol version
  • Production Ready - Used on real hardware

📄 License

Apache License 2.0 - See LICENSE for details.


🙏 Acknowledgments


📞 Contact

Maintainer: @ArangoGutierrez
Issues: GitHub Issues
Discussions: GitHub Discussions


⭐ Star us on GitHub — it helps!

Report Bug · Request Feature · View Roadmap

Reviews

No reviews yet

Sign in to write a review