Dual-System Avatar Technology
OmniHuman-1.5 is a groundbreaking virtual human generation framework that creates expressive character animations from a single image and an audio track. Inspired by the human "System 1 and System 2" cognitive theory, the technology combines a Multimodal Large Language Model with a Diffusion Transformer, simulating two distinct modes of thought: slow, deliberate planning and fast, intuitive reaction. This innovative combination allows for the generation of videos with highly dynamic motion, continuous camera movement, and complex multi-character interactions, all in precise sync with the audio.
Rhythm and Performance
The framework's versatility allows it to easily handle musical scenarios. With just a single image and a song, it can create an emotionally rich digital singer. The model captures the full range of expression in music, from lip-syncing to natural pauses and rhythmic changes, flexibly adapting to styles ranging from solo ballads to fast-paced concerts.
Cinematic Emotional Performances
Using just a single image and audio, this framework brings digital actors to life on screen. By analyzing the emotional subtext in the audio, without text prompts, it can generate captivating, cinematic performances with full dramatic range, from explosive anger to heartfelt confession.
Context-Aware Audio-Driven Animation
By interpreting the semantic context of audio, the model goes beyond simple lip-sync and repetitive motions, allowing characters to exhibit genuine emotional shifts and match gestures to their words, as if driven by a will of their own.
Text-Guided Multimodal Animation
This framework accepts text prompts and demonstrates exceptional prompt-following, enabling precise control over object generation, camera movements, and specific actions, all while maintaining perfect audio sync.
The camera follows as the man turns to face it and walks forward, singing in ecstasy. At times he touches his collar with both hands; at others he spreads his arms and lifts his head, lost in rapture.
Prompt: The camera zooms in rapidly to a close-up of the woman’s shoes, then slowly pans up to her face. The beautiful girl sways her body charmingly.
The man sang intoxicatedly. He first glanced out the window, then placed his left hand on his chest as if in rapture. Next, he stood up and walked forward along the train aisle, once again placing his left hand on his chest.
Prompt: Handheld camera. A woman looks into the distance. In the background, there are fireworks. The wind is blowing her hair and clothes. It has the feel of an arthouse film, a lonely atmosphere, and is shot on film
Prompt: Circle the camera to the right. When the camera focuses on the man's face, hold it still for a low, somber mood
Prompt: The character's face moves forward, they look at the camera, then reach out and poke the camera lens. After that, the camera moves backward, and the character crosses their arms and starts to talk.
Prompt: Man takes cigarette out, looks to camera, speaks.
Prompt: A penguin is dancing. A pair of hands puts a cool pair of sunglasses on it. A band is playing, and the audience is cheering.
Prompt: A chick wearing sunglasses, holding two guns, talking, with an evil vibe.
Multi-Person Scene Performance
The framework also extends to complex multi-person scenes. It can generate dynamic group dialogues and ensemble performances by routing separate audio tracks to the correct characters within a single frame.
Diversity and Robustness
This model demonstrates true robustness by generating high-quality, synchronized video across an incredible diversity of subjects, including real animals, anthropomorphic characters, and stylized cartoons.