MCP Hub
Back to servers

StableAvatar

MCP (Model Context Protocol) Server. Generates synchronized talking head videos from reference images and audio using diffusion models with multi-GPU support.

Stars
1
Updated
Jan 3, 2026
Validated
Jan 11, 2026

English | 简体中文 | 繁體中文 | 日本語

StableAvatar

Docker License Python CUDA

🎬 Infinite-Length Audio-Driven Avatar Video Generation - Generate high-quality talking head videos from a single image and audio input.

Based on StableAvatar research paper, this project provides a production-ready Docker deployment with Web UI, REST API, and MCP server support.

✨ Features

  • 🖼️ Single Image Input - Generate videos from just one reference image
  • 🎵 Audio-Driven - Lip-sync animation driven by any audio input
  • ♾️ Infinite Length - Generate videos of any duration without quality degradation
  • 🎯 Identity Preserving - Maintains facial identity throughout the video
  • 🌐 Web UI - User-friendly interface with real-time GPU monitoring
  • 🔌 REST API - Full-featured API with Swagger documentation
  • 🤖 MCP Server - Claude Desktop integration support
  • 🐳 Docker Ready - One-command deployment with all dependencies

🚀 Quick Start

Docker (Recommended)

# Pull and run
docker run -d \
  --name stableavatar \
  --gpus all \
  -p 8400:8400 \
  -v ./checkpoints:/app/checkpoints:ro \
  -v ./outputs:/app/outputs \
  neosun/stableavatar:latest

# Access Web UI
open http://localhost:8400

Docker Compose

version: '3.8'
services:
  stableavatar:
    image: neosun/stableavatar:latest
    container_name: stableavatar
    ports:
      - "8400:8400"
    volumes:
      - ./checkpoints:/app/checkpoints:ro
      - ./outputs:/app/outputs
    environment:
      - GPU_MEMORY_MODE=model_cpu_offload
      - GPU_IDLE_TIMEOUT=300
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped
docker-compose up -d

📦 Installation

Prerequisites

  • NVIDIA GPU with 18GB+ VRAM (RTX 4090, L40S, A100, etc.)
  • Docker with NVIDIA Container Toolkit
  • ~30GB disk space for model checkpoints

Download Model Checkpoints

# Install huggingface-cli
pip install "huggingface_hub[cli]"

# Download all checkpoints
mkdir -p checkpoints
huggingface-cli download FrancisRing/StableAvatar --local-dir ./checkpoints

Model Files Structure

checkpoints/
├── Kim_Vocal_2.onnx              # Vocal separation model
├── wav2vec2-base-960h/           # Audio encoder
├── Wan2.1-Fun-V1.1-1.3B-InP/     # Base diffusion model
└── StableAvatar-1.3B/            # StableAvatar weights
    ├── transformer3d-square.pt   # For 512x512 output
    └── transformer3d-rec-vec.pt  # For 480x832/832x480 output

⚙️ Configuration

Environment Variables

VariableDefaultDescription
PORT8400Service port
GPU_MEMORY_MODEmodel_cpu_offloadMemory optimization mode
GPU_IDLE_TIMEOUT300Auto-release VRAM after idle (seconds)
NVIDIA_VISIBLE_DEVICESallGPU device selection

GPU Memory Modes

ModeVRAM UsageSpeedDescription
model_full_load~18GBFastestLoad all models to GPU
model_cpu_offload~10GBFastOffload to CPU when idle
sequential_cpu_offload~3GBSlowMinimal VRAM usage
model_cpu_offload_and_qfloat8~8GBMediumFP8 quantization

📖 Usage

Web UI

Access http://localhost:8400 for the web interface:

  1. Upload a reference image (face photo)
  2. Upload driving audio (speech/singing)
  3. Set parameters (resolution, steps, etc.)
  4. Click "Generate Video"

REST API

# Generate video
curl -X POST "http://localhost:8400/api/generate" \
  -F "image=@reference.jpg" \
  -F "audio=@speech.wav" \
  -F "prompt=A person is talking" \
  -F "width=512" \
  -F "height=512" \
  -F "num_inference_steps=30"

# Check status
curl "http://localhost:8400/api/status/{task_id}"

# Download result
curl -O "http://localhost:8400/api/result/{task_id}"

API Documentation

Swagger UI available at: http://localhost:8400/apidocs/

🔧 API Reference

POST /api/generate

Generate avatar video from image and audio.

Parameters:

NameTypeDefaultDescription
imagefilerequiredReference face image
audiofilerequiredDriving audio file
promptstring""Text prompt for generation
widthint512Output width (512/480/832)
heightint512Output height (512/832/480)
num_inference_stepsint50Denoising steps (30-50)
seedint-1Random seed (-1 for random)
text_guide_scalefloat3.0Text guidance scale
audio_guide_scalefloat5.0Audio guidance scale

GET /api/status/{task_id}

Get task status and progress.

GET /api/result/{task_id}

Download generated video.

🛠️ Tech Stack

  • Model: Wan2.1-1.3B DiT + StableAvatar Audio Adapter
  • Framework: PyTorch 2.6, Diffusers
  • Audio: Wav2Vec2, Librosa
  • Backend: Flask, Gunicorn
  • Frontend: Vanilla JS, TailwindCSS

📊 Performance

ResolutionDurationVRAMTime (RTX 4090)
512x5125s~18GB~3 min
512x51220s~20GB~10 min
480x8325s~20GB~4 min

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

📝 Changelog

v1.0.0 (2026-01-04)

  • Initial release with Docker support
  • Web UI with multi-language support
  • REST API with Swagger documentation
  • MCP server integration

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

🙏 Acknowledgments

⭐ Star History

Star History Chart

📱 Follow Us

WeChat

Reviews

No reviews yet

Sign in to write a review