Overview
A production AI avatar system that generates real-time lip-synced video of virtual humans speaking generated text. The system combines large language models for conversation, text-to-speech for audio generation, and MuseTalk neural lip-sync to produce video streams delivered via WebRTC with 2–5 second latency.
The Challenge
Creating realistic, interactive AI avatars that respond in real time is a multi-server orchestration problem. You need an LLM to generate the response, a TTS engine to produce audio, a lip-sync model to animate the face, and a streaming pipeline to deliver the video — all fast enough to feel like a live conversation.
What I Built
- Multi-server architecture with dedicated GPU servers for lip-sync inference and a control server for orchestration
- Avatar preparation pipeline that pre-computes facial latent encodings and face masks for fast runtime rendering
- WebRTC streaming for low-latency video delivery directly to the browser
- SocketIO signaling for real-time frame delivery and connection management
- GPU health monitoring with automatic failover between available inference servers
- LLM integration with streaming token generation piped through TTS and lip-sync in a continuous pipeline
Tech Stack
Python, Flask, PyTorch, MuseTalk v1.5, WebRTC, SocketIO, DeepInfra API (Llama, TTS), PostgreSQL, CUDA GPU inference
