StableAvatar

🎬 Infinite-Length Audio-Driven Avatar Video Generation - Generate high-quality talking head videos from a single image and audio input.

Based on StableAvatar research paper, this project provides a production-ready Docker deployment with Web UI, REST API, and MCP server support.

✨ Features

🖼️ Single Image Input - Generate videos from just one reference image
🎵 Audio-Driven - Lip-sync animation driven by any audio input
♾️ Infinite Length - Generate videos of any duration without quality degradation
🎯 Identity Preserving - Maintains facial identity throughout the video
🌐 Web UI - User-friendly interface with real-time GPU monitoring
🔌 REST API - Full-featured API with Swagger documentation
🤖 MCP Server - Claude Desktop integration support
🐳 Docker Ready - One-command deployment with all dependencies

🚀 Quick Start

Docker (Recommended)

# Pull and run
docker run -d \
  --name stableavatar \
  --gpus all \
  -p 8400:8400 \
  -v ./checkpoints:/app/checkpoints:ro \
  -v ./outputs:/app/outputs \
  neosun/stableavatar:latest

# Access Web UI
open http://localhost:8400

Docker Compose

version: '3.8'
services:
  stableavatar:
    image: neosun/stableavatar:latest
    container_name: stableavatar
    ports:
      - "8400:8400"
    volumes:
      - ./checkpoints:/app/checkpoints:ro
      - ./outputs:/app/outputs
    environment:
      - GPU_MEMORY_MODE=model_cpu_offload
      - GPU_IDLE_TIMEOUT=300
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped

docker-compose up -d

📦 Installation

Prerequisites

NVIDIA GPU with 18GB+ VRAM (RTX 4090, L40S, A100, etc.)
Docker with NVIDIA Container Toolkit
~30GB disk space for model checkpoints

Download Model Checkpoints

# Install huggingface-cli
pip install "huggingface_hub[cli]"

# Download all checkpoints
mkdir -p checkpoints
huggingface-cli download FrancisRing/StableAvatar --local-dir ./checkpoints

Model Files Structure

checkpoints/
├── Kim_Vocal_2.onnx              # Vocal separation model
├── wav2vec2-base-960h/           # Audio encoder
├── Wan2.1-Fun-V1.1-1.3B-InP/     # Base diffusion model
└── StableAvatar-1.3B/            # StableAvatar weights
    ├── transformer3d-square.pt   # For 512x512 output
    └── transformer3d-rec-vec.pt  # For 480x832/832x480 output

⚙️ Configuration

Environment Variables

Variable	Default	Description
`PORT`	`8400`	Service port
`GPU_MEMORY_MODE`	`model_cpu_offload`	Memory optimization mode
`GPU_IDLE_TIMEOUT`	`300`	Auto-release VRAM after idle (seconds)
`NVIDIA_VISIBLE_DEVICES`	`all`	GPU device selection

GPU Memory Modes

Mode	VRAM Usage	Speed	Description
`model_full_load`	~18GB	Fastest	Load all models to GPU
`model_cpu_offload`	~10GB	Fast	Offload to CPU when idle
`sequential_cpu_offload`	~3GB	Slow	Minimal VRAM usage
`model_cpu_offload_and_qfloat8`	~8GB	Medium	FP8 quantization

📖 Usage

Web UI

Access http://localhost:8400 for the web interface:

Upload a reference image (face photo)
Upload driving audio (speech/singing)
Set parameters (resolution, steps, etc.)
Click "Generate Video"

REST API

# Generate video
curl -X POST "http://localhost:8400/api/generate" \
  -F "image=@reference.jpg" \
  -F "audio=@speech.wav" \
  -F "prompt=A person is talking" \
  -F "width=512" \
  -F "height=512" \
  -F "num_inference_steps=30"

# Check status
curl "http://localhost:8400/api/status/{task_id}"

# Download result
curl -O "http://localhost:8400/api/result/{task_id}"

API Documentation

Swagger UI available at: http://localhost:8400/apidocs/

🔧 API Reference

POST /api/generate

Generate avatar video from image and audio.

Parameters:

Name	Type	Default	Description
`image`	file	required	Reference face image
`audio`	file	required	Driving audio file
`prompt`	string	""	Text prompt for generation
`width`	int	512	Output width (512/480/832)
`height`	int	512	Output height (512/832/480)
`num_inference_steps`	int	50	Denoising steps (30-50)
`seed`	int	-1	Random seed (-1 for random)
`text_guide_scale`	float	3.0	Text guidance scale
`audio_guide_scale`	float	5.0	Audio guidance scale

GET /api/status/{task_id}

Get task status and progress.

GET /api/result/{task_id}

Download generated video.

🛠️ Tech Stack

Model: Wan2.1-1.3B DiT + StableAvatar Audio Adapter
Framework: PyTorch 2.6, Diffusers
Audio: Wav2Vec2, Librosa
Backend: Flask, Gunicorn
Frontend: Vanilla JS, TailwindCSS

📊 Performance

Resolution	Duration	VRAM	Time (RTX 4090)
512x512	5s	~18GB	~3 min
512x512	20s	~20GB	~10 min
480x832	5s	~20GB	~4 min

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

📝 Changelog

v1.0.0 (2026-01-04)

Initial release with Docker support
Web UI with multi-language support
REST API with Swagger documentation
MCP server integration

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

StableAvatar

StableAvatar

✨ Features

🚀 Quick Start

Docker (Recommended)

Docker Compose

📦 Installation

Prerequisites

Download Model Checkpoints

Model Files Structure

⚙️ Configuration

Environment Variables

GPU Memory Modes

📖 Usage

Web UI

REST API

API Documentation

🔧 API Reference

POST /api/generate

GET /api/status/{task_id}

GET /api/result/{task_id}

🛠️ Tech Stack

📊 Performance

🤝 Contributing

📝 Changelog

v1.0.0 (2026-01-04)

📄 License

🙏 Acknowledgments

⭐ Star History

📱 Follow Us

Reviews