IndexTTS2: Smart Speech Synthesis, Precise Emotion and Duration Control
Core Features & Advantages
IndexTTS2 is a revolutionary zero-shot Text-to-Speech (TTS) model
designed to overcome the duration control challenges of traditional
autoregressive systems. It achieves precise control over synthesized
speech duration for the first time, making it especially suitable for
applications like video dubbing that demand strict synchronization.
This model innovatively disentangles emotional expression from speaker identity,
allowing users to independently control timbre and emotion. Whether reproducing
the native emotion from a prompt audio or customizing emotions via an independent
emotional prompt (which can originate from a different speaker), IndexTTS2
delivers accurate results.
To enhance speech clarity during strong emotional expressions, IndexTTS2
integrates GPT latent representations. It also introduces a soft instruction
mechanism based on textual descriptions, fine-tuned with Qwen3, significantly
simplifying emotion control. Users can guide desired emotional tendencies
with natural language input.
Experimental results demonstrate that IndexTTS2 surpasses existing state-of-the-art
zero-shot TTS models in word error rate, speaker similarity, and emotional
fidelity. We will release the model weights and inference code to foster
further research and industry adoption.
Feature Demos
Contents
1. Controllability: Zero-Shot TTS with Adjustable Speech Duration
IndexTTS2 achieves, for the first time, precise control over speech duration in autoregressive TTS models. This section demonstrates the model's capability in duration-controlled synthesis, where three different target durations are specified for each test case. Due to the lack of existing autoregressive TTS models with comparable duration control capabilities, we adopt non-autoregressive models as the baselines in this part.
Audio Prompt | Text | Ground Truth | Model | Audio (duration 0.75x) | Audio (duration 1.0x) | Audio (duration 1.25x) |
---|---|---|---|---|---|---|
The equipment needed to do this includes rock saws and polishers. | IndexTTS2 | |||||
MaskGCT | ||||||
F5-TTS | ||||||
There is no wine in this country, the young man said. | IndexTTS2 | |||||
MaskGCT | ||||||
F5-TTS | ||||||
只有当科技为本地社群创造价值的时候,才真正有意义。 | IndexTTS2 | |||||
MaskGCT | ||||||
F5-TTS | ||||||
类推可用于颠覆惯性思维,以便为新的创意开路。 | IndexTTS2 | |||||
MaskGCT | ||||||
F5-TTS |
2. Controllability: Zero-Shot TTS with Emotionally Expressive Speech
IndexTTS2 is capable of accurately reconstructing the emotional content present in the prompt audio. Thanks to the model’s effective disentanglement of emotional attributes from speaker-related features, users can explicitly control the target emotion by providing an additional emotional audio prompt, enabling the synthesis of speech with specified emotional expressions. Moreover, our framework is well integrated with a natural language-driven emotional modulation mechanism, enabling precise and semantically meaningful emotion customization.
2.1 Using Identical Prompt
We use the same audio prompt across all input conditions, meaning that the reference for both timbre and emotion is based on the exact same audio content. This design ensures that any variation in the generated speech comes purely from the target emotion or timbre, eliminating confounding factors introduced by using different emotional prompts from other timbres.
Emotion | Audio Prompt | Text | IndexTTS2 | IndexTTS | MaskGCT | CosyVoice2 | SparkTTS | F5-TTS | IndexTTS2-wog | IndexTTS2-wos2m |
---|---|---|---|---|---|---|---|---|---|---|
Angry | 你在我们屋子里走路的时候,发现路程遥远,这是不足为怪的。 | |||||||||
似乎科琳完成的这身午夜蓝套,裙与旧时代的职业女性并无分别。 | ||||||||||
Cry | 共同建设面向未来的交通,和出行服务新生态 | |||||||||
汤姆,我真愿意信你的话,这样可以一肥遮百丑。 | ||||||||||
Fear | 但到投票前日,内菲斯竟以黑马之姿冲过席尔瓦,日渐下降的支持率。 | |||||||||
过了一会一切都结束了,这座山在月光下显得幽静而静谧。 | ||||||||||
Depressed | 基本上隔一天,小如便会因为不听话而挨揍。 | |||||||||
狗狗阿黄同志,当森林学校的门卫有五年啦,工作尽职尽责。 | ||||||||||
Happy | 更傻眼的是过了没多久,银行就开始催款了。 | |||||||||
其中一只正又两条前肢,抓住一只有自己身体五倍大的死蜘蛛。 | ||||||||||
Surprise | 他希望能看到灯笼闪一下光,这虽然让他害怕。 | |||||||||
比如有的业主,贪便宜找马路上的游击队来装修。 | ||||||||||
Calm | 攀爬上官场高位后,开始给家里的各种亲戚安排工作。 | |||||||||
近日,除了葛洲坝股价下跌外,其余三家均有不同程度的上涨。 |
2.2 Using Different Prompts
We employ distinct audio prompts as references for timbre and emotional expression, respectively, such that speaker-related timbral characteristics and emotion-related prosodic and intonational features are derived from separate audio sources. This ablation study enables a more effective validation of the emotion modulation mechanism, eliminating confounding effects caused by variations in timbre.
Timbre Audio Prompt | Emotion Audio Prompt | Text | Emotion Weight: 0 | Emotion Weight: 0.6 | Emotion Weight: 1.0 | Emotion Weight: 1.4 |
---|---|---|---|---|---|---|
这一天,天上的乌云又多又厚又沉,整个森林暗得就像黑夜一样。 | ||||||
这他妈就是你给的解决方案?老子连续加班三个月,就换来一沓废纸!现在、立刻、马上给我滚出! | ||||||
我站在人海中,却感觉比任何时候都要孤独。 | ||||||
尾号四四九幺的乘客刚夸了你,厉害了我的师傅,你真是个活地图。 | ||||||
有些人走了就再也没有回来过,所以等待和犹豫是这个世界上最无情的杀手。 | ||||||
做一个温暖的人,将岁月里的凝重、安暖,写意成简单,将过往的风景,安放在清浅的时光中。 |
2.3 Using Textual Description
In this alternative approach, we replace the emotional audio reference with a descriptive textual prompt, while maintaining the timbre reference audio unchanged. This allows the model to condition emotional expression solely on text-based emotion cues rather than on audio signals. Such a setup facilitates the investigation of the model’s capability to generate emotionally expressive speech guided by linguistic descriptions, providing insights into the effectiveness of text-driven emotion control without relying on emotion audio prompts.
Timbre Audio Prompt | Emotion Description | Text | Audio |
---|---|---|---|
I feel really down | 这究竟是我的福,还是我的孽?岂止是皇上错了,我更是错了!这几年的情爱与时光,究竟是错付了! | ||
有点快乐,哈哈 | |||
巨巨巨巨巨巨巨巨难过 | |||
I feel really down | Was this my blessing, or my curse? It’s not just the Emperor who was wrong — I was even more mistaken! All these years of love and devotion… in the end, were they nothing but a wasted heart? | ||
有点快乐,哈哈 | |||
巨巨巨巨巨巨巨巨难过 | |||
书桓走的第一天,想他,想他,想他。 | 书桓走的第一天,想他,想他,想他。 | ||
On the first day that Shuhan left, all I did was miss him. Miss him. Miss him. | |||
书桓走的第二天,想他,想他,想他。 | 书桓走的第二天,想他,想他,想他。 | ||
The second day Shuhan is gone, and still — I miss him. I miss him. I miss him. | |||
书桓走了第三天了,想他,想他,想他,发疯一样的想他。 | 书桓走了第三天了,想他,想他,想他,发疯一样的想他。 | ||
The third day Shuhan has been gone… and I still miss him. Miss him. Miss him. I miss him like I've lost my mind. | |||
超级无敌爆炸angry的情感,就像刚中了彩票被人偷拿了 | 你问他为什么我没谈恋爱,我就失恋了,你问他,为什么这么对我,我以为我会问,可是我见到他之后,我就不想问了,因为人家根本就不想说,人家甚至都不想见到你,我为什么在那儿犯贱呢,所以我不是放过他,我是想放过我自己,人家不联系你怎么了,不回你微信怎么了,伤害你又怎么了,你算他谁啊? | ||
又生气又委屈 | 我为什么非得知道发生了什么呢,我不就是想给自己找一个原因嘛,我只是想找一个原因,我原谅他,可是我为什么非得原谅他呢,我干嘛把自己搞得这么卑微啊? | ||
我们正在做一些神奇的事情,给我来一种又fear,但是又有点开心的情感。 | 这游戏太刺激了,心跳都快停了...但我们又能感受到那种挑战未知的兴奋和快乐。说真的,我现在是又害怕又期待,紧张得手心都在冒汗! |
3. Zero-Shot TTS Comparison With Other Open-Source Models
3.1 Monolingual Speech Synthesis
Audio Prompt | Text | Ground Truth | Model | Audio (duration 0.75x) | Audio (duration 1.0x) | Audio (duration 1.25x) | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
家居养娃的李娜又重新出现在媒体大众的面前 | IndexTTS2 | ||||||||||
These are two of only three known formations to have dinosaur fossils in Antarctica. | IndexTTS2 | ||||||||||
The man looked at him without responding. | IndexTTS2 | ||||||||||
胡萝卜凉拌或炒鸡蛋味道都是棒极的,胡萝卜骄傲地说。 | IndexTTS2 | ||||||||||
rodolfo arrived at his own house without any impediment and leocadia's parents reached theirs heart broken and despairing | IndexTTS2 | ||||||||||
那些袖珍衣服挂在架子上,远远看上去就像一幅画,可漂亮了。 | IndexTTS2 |