StableAvatar
🎬 Infinite-Length Audio-Driven Avatar Video Generation - Generate high-quality talking head videos from a single image and audio input.
Based on StableAvatar research paper, this project provides a production-ready Docker deployment with Web UI, REST API, and MCP server support.
✨ Features
- 🖼️ Single Image Input - Generate videos from just one reference image
- 🎵 Audio-Driven - Lip-sync animation driven by any audio input
- ♾️ Infinite Length - Generate videos of any duration without quality degradation
- 🎯 Identity Preserving - Maintains facial identity throughout the video
- 🌐 Web UI - User-friendly interface with real-time GPU monitoring
- 🔌 REST API - Full-featured API with Swagger documentation
- 🤖 MCP Server - Claude Desktop integration support
- 🐳 Docker Ready - One-command deployment with all dependencies
🚀 Quick Start
Docker (Recommended)
# Pull and run
docker run -d \
--name stableavatar \
--gpus all \
-p 8400:8400 \
-v ./checkpoints:/app/checkpoints:ro \
-v ./outputs:/app/outputs \
neosun/stableavatar:latest
# Access Web UI
open http://localhost:8400
Docker Compose
version: '3.8'
services:
stableavatar:
image: neosun/stableavatar:latest
container_name: stableavatar
ports:
- "8400:8400"
volumes:
- ./checkpoints:/app/checkpoints:ro
- ./outputs:/app/outputs
environment:
- GPU_MEMORY_MODE=model_cpu_offload
- GPU_IDLE_TIMEOUT=300
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: unless-stopped
docker-compose up -d
📦 Installation
Prerequisites
- NVIDIA GPU with 18GB+ VRAM (RTX 4090, L40S, A100, etc.)
- Docker with NVIDIA Container Toolkit
- ~30GB disk space for model checkpoints
Download Model Checkpoints
# Install huggingface-cli
pip install "huggingface_hub[cli]"
# Download all checkpoints
mkdir -p checkpoints
huggingface-cli download FrancisRing/StableAvatar --local-dir ./checkpoints
Model Files Structure
checkpoints/
├── Kim_Vocal_2.onnx # Vocal separation model
├── wav2vec2-base-960h/ # Audio encoder
├── Wan2.1-Fun-V1.1-1.3B-InP/ # Base diffusion model
└── StableAvatar-1.3B/ # StableAvatar weights
├── transformer3d-square.pt # For 512x512 output
└── transformer3d-rec-vec.pt # For 480x832/832x480 output
⚙️ Configuration
Environment Variables
| Variable | Default | Description |
|---|---|---|
PORT | 8400 | Service port |
GPU_MEMORY_MODE | model_cpu_offload | Memory optimization mode |
GPU_IDLE_TIMEOUT | 300 | Auto-release VRAM after idle (seconds) |
NVIDIA_VISIBLE_DEVICES | all | GPU device selection |
GPU Memory Modes
| Mode | VRAM Usage | Speed | Description |
|---|---|---|---|
model_full_load | ~18GB | Fastest | Load all models to GPU |
model_cpu_offload | ~10GB | Fast | Offload to CPU when idle |
sequential_cpu_offload | ~3GB | Slow | Minimal VRAM usage |
model_cpu_offload_and_qfloat8 | ~8GB | Medium | FP8 quantization |
📖 Usage
Web UI
Access http://localhost:8400 for the web interface:
- Upload a reference image (face photo)
- Upload driving audio (speech/singing)
- Set parameters (resolution, steps, etc.)
- Click "Generate Video"
REST API
# Generate video
curl -X POST "http://localhost:8400/api/generate" \
-F "image=@reference.jpg" \
-F "audio=@speech.wav" \
-F "prompt=A person is talking" \
-F "width=512" \
-F "height=512" \
-F "num_inference_steps=30"
# Check status
curl "http://localhost:8400/api/status/{task_id}"
# Download result
curl -O "http://localhost:8400/api/result/{task_id}"
API Documentation
Swagger UI available at: http://localhost:8400/apidocs/
🔧 API Reference
POST /api/generate
Generate avatar video from image and audio.
Parameters:
| Name | Type | Default | Description |
|---|---|---|---|
image | file | required | Reference face image |
audio | file | required | Driving audio file |
prompt | string | "" | Text prompt for generation |
width | int | 512 | Output width (512/480/832) |
height | int | 512 | Output height (512/832/480) |
num_inference_steps | int | 50 | Denoising steps (30-50) |
seed | int | -1 | Random seed (-1 for random) |
text_guide_scale | float | 3.0 | Text guidance scale |
audio_guide_scale | float | 5.0 | Audio guidance scale |
GET /api/status/{task_id}
Get task status and progress.
GET /api/result/{task_id}
Download generated video.
🛠️ Tech Stack
- Model: Wan2.1-1.3B DiT + StableAvatar Audio Adapter
- Framework: PyTorch 2.6, Diffusers
- Audio: Wav2Vec2, Librosa
- Backend: Flask, Gunicorn
- Frontend: Vanilla JS, TailwindCSS
📊 Performance
| Resolution | Duration | VRAM | Time (RTX 4090) |
|---|---|---|---|
| 512x512 | 5s | ~18GB | ~3 min |
| 512x512 | 20s | ~20GB | ~10 min |
| 480x832 | 5s | ~20GB | ~4 min |
🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
📝 Changelog
v1.0.0 (2026-01-04)
- Initial release with Docker support
- Web UI with multi-language support
- REST API with Swagger documentation
- MCP server integration
📄 License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
🙏 Acknowledgments
- StableAvatar - Original research and model
- Wan2.1 - Base video diffusion model
⭐ Star History
📱 Follow Us
