LocalVoiceMode

Local voice interface with Character Skills - Self-contained voice chat system.

Uses Parakeet TDT 0.6B (NVIDIA) for fast GPU speech recognition, Pocket TTS (Kyutai) for natural text-to-speech. Auto-detects LM Studio, OpenRouter, or OpenAI as the LLM backend.

Features

Parakeet TDT ASR - NVIDIA's fast speech recognition (GPU accelerated via ONNX)
Pocket TTS - Kyutai's natural-sounding text-to-speech with voice cloning
Smart Turn Detection - Knows when you're done speaking, not just detecting silence
Auto-Provider Detection - Automatically finds LM Studio, or falls back to OpenRouter/OpenAI
Modern Rich UI - Beautiful terminal interface with audio visualization
Character Skills - Load different personalities with custom voices
MCP Integration - Works with Claude Code and other MCP-enabled tools

Quick Start

1. Clone and Setup

git clone https://github.com/your-username/localvoicemode.git
cd localvoicemode
setup.bat

This creates a virtual environment and installs all dependencies.

2. HuggingFace Login (Required)

Pocket TTS requires accepting the model license:

.venv\Scripts\huggingface-cli.exe login

Then accept the license at: https://huggingface.co/kyutai/pocket-tts

3. Configure LLM Provider

Option A: LM Studio (Recommended for local)

Open LM Studio
Load your preferred model
Start the local server (default: http://localhost:1234)

Option B: OpenRouter

set OPENROUTER_API_KEY=your-key-here

Get your key at: https://openrouter.ai/keys

Option C: OpenAI

set OPENAI_API_KEY=your-key-here

4. Run Voice Chat

REM Default assistant
VoiceChat.bat

REM With Hermione character
VoiceChat.bat hermione

REM Push-to-talk mode
VoiceChat.bat hermione ptt

Provider Detection

LocalVoiceMode automatically detects available providers in this order:

LM Studio - Scans ports 1234, 1235, 1236, 8080, 5000
OpenRouter - Uses OPENROUTER_API_KEY environment variable
OpenAI - Uses OPENAI_API_KEY environment variable

Force a specific provider with VOICE_PROVIDER=openrouter (or lm_studio, openai).

Directory Structure

localvoicemode/
├── voice_client.py        # Main voice client entry point
├── mcp_server.py          # MCP server for AI assistant integration
├── requirements.txt       # Python dependencies
├── setup.bat              # Setup script (run first!)
├── VoiceChat.bat          # Launch script
├── start_voicemode.bat    # MCP server launcher
│
├── src/localvoicemode/    # Core package
│   ├── audio/             # Audio recording
│   ├── speech/            # ASR, TTS, VAD, filters
│   ├── llm/               # Provider management
│   ├── skills/            # Skill loading
│   └── state/             # State machines, config
│
├── skills/                # Character skills
│   ├── assistant-default/ # Default assistant
│   └── hermione-companion/
│       ├── SKILL.md       # Character definition
│       ├── references/    # Lore files
│       └── scripts/       # Helper scripts
│
└── voice_references/      # Custom voice files (.wav)

Skills System

Skills define character personalities, system prompts, and optional knowledge.

List Available Skills

.venv\Scripts\python.exe voice_client.py --list-skills

Create a New Skill

Create directory: skills/my-skill/
Create SKILL.md:

---
id: my-skill
name: My Character
display_name: "My Character"
description: Brief description
metadata:
  greeting: "Hello! How can I help?"
---

# My Character

## System Prompt

You are My Character. [Full instructions here...]

Add optional files:
- reference.wav - Voice clone source (10s of clear speech)
- avatar.png - Character image
- references/ - Knowledge markdown files

Voice Cloning

Pocket TTS supports voice cloning from reference audio.

Requirements:

WAV format (16-bit PCM)
~10 seconds of clean speech
Clear recording, minimal background noise

Place the file at:

skills/my-skill/reference.wav (per-skill), or
voice_references/my-skill.wav (global)

Voice Modes

VAD Mode (default)

Voice Activity Detection with Smart Turn - automatically detects when you're done speaking.

VoiceChat.bat hermione

PTT Mode

Push-to-Talk - hold Space to record, release to send.

VoiceChat.bat hermione ptt

Configuration

Environment Variables

Variable	Default	Description
`VOICE_API_URL`	`http://localhost:1234/v1`	OpenAI-compatible API URL
`VOICE_API_KEY`	(none)	API key for the provider
`VOICE_MODEL`	(auto)	Model name to use
`VOICE_PROVIDER`	(auto)	Force provider: lm_studio, openrouter, openai
`OPENROUTER_API_KEY`	(none)	OpenRouter API key
`OPENAI_API_KEY`	(none)	OpenAI API key
`VOICE_TTS_VOICE`	`alba`	Default TTS voice
`VOICE_DEVICE`	`cuda`	ASR device: cuda (GPU) or cpu
`VOICE_SMART_TURN_THRESHOLD`	`0.5`	Turn completion threshold (0.0-1.0)

Command Line Options

python voice_client.py [options]

Options:
  --skill, -s SKILL      Load a character skill
  --list-skills, -l      List available skills
  --list-providers       List available LLM providers
  --provider, -p PROV    Force provider: lm_studio, openrouter, openai
  --mode, -m MODE        Input mode: vad, ptt, or type
  --device DEVICE        ASR device: cuda or cpu
  --api-url URL          OpenAI-compatible API URL
  --api-key KEY          API key for the provider
  --model MODEL          Model name to use
  --headless             Run without UI (for MCP integration)

MCP Integration

LocalVoiceMode includes an MCP server for integration with Claude Code and other MCP-enabled tools.

Start MCP Server

start_voicemode.bat

Available Tools

speak(text) - Speak text aloud (TTS)
listen() - Listen for speech (STT)
converse(text) - Speak and listen for response
start_voice(skill) - Start voice chat with a character
stop_voice() - Stop voice chat
voice_status() - Check if voice mode is running
list_voices() - List available characters
provider_status() - Show available providers
set_speech_mode(mode) - Set verbosity: roleplay, coder, minimal, silent
get_speech_mode() - Get current speech mode

Slash Commands

These slash commands are available in Claude Code and compatible AI assistants:

Command	Description
`/speak <text>`	TTS only - speak text aloud
`/listen`	STT only - transcribe speech to text
`/tts-only`	Mode: Claude speaks, you type
`/stt-only`	Mode: You speak, Claude responds in text
`/voice-roleplay`	Full expressive speech output
`/voice-coder`	Summaries & completions only
`/voice`	Speak one message via voice
`/voice-on`	Start continuous voice mode
`/voice-off`	Stop voice mode
`/voice-typing`	You type, Claude speaks (hold RIGHT SHIFT to speak)

Speech Modes

Control how much Claude speaks:

Mode	Description
`roleplay`	Full expressive output - speaks everything naturally (default)
`coder`	Summaries only - task completions, errors, questions
`minimal`	Very terse - only critical announcements
`silent`	No speech - text only

Switch modes with /voice-roleplay, /voice-coder, or the set_speech_mode() tool.

Voice Commands While Running

Say "stop" or "goodbye" to end
Say "change voice" to switch characters

GPU Support

Parakeet TDT uses ONNX Runtime with GPU acceleration:

TensorRT (best performance) - Auto-detected if installed
CUDA (good performance) - Requires CUDA/cuDNN
CPU (fallback) - Always available

Check GPU status:

.venv\Scripts\python.exe -c "import onnxruntime as ort; print(ort.get_available_providers())"

Troubleshooting

No audio detected

Check microphone permissions
Verify default audio device: python -c "import sounddevice; print(sounddevice.query_devices())"

Pocket TTS not working

Accept license: https://huggingface.co/kyutai/pocket-tts
Login: .venv\Scripts\huggingface-cli.exe login

LM Studio connection failed

Verify LM Studio server is running
Check URL: default is http://localhost:1234
Ensure a model is loaded

OpenRouter/OpenAI not working

Verify API key is set in .env or environment
Check python voice_client.py --list-providers to see detected providers

GPU/CUDA not working

Ensure NVIDIA drivers are installed
Install CUDA Toolkit 12.x
Reinstall: pip uninstall onnxruntime onnxruntime-gpu && pip install onnxruntime-gpu[cuda,cudnn]

Credits

Parakeet TDT: NVIDIA NeMo - Apache 2.0
Pocket TTS: Kyutai - CC-BY-4.0
Smart Turn: Livekit - Apache 2.0
Silero VAD: Silero - MIT

voiceblitz-mcp