IndexTTS2: Smart Speech Synthesis, Precise Emotion and Duration Control

Core Features & Advantages

IndexTTS2 is a revolutionary zero-shot Text-to-Speech (TTS) model designed to overcome the duration control challenges of traditional autoregressive systems. It achieves precise control over synthesized speech duration for the first time, making it especially suitable for applications like video dubbing that demand strict synchronization.

This model innovatively disentangles emotional expression from speaker identity, allowing users to independently control timbre and emotion. Whether reproducing the native emotion from a prompt audio or customizing emotions via an independent emotional prompt (which can originate from a different speaker), IndexTTS2 delivers accurate results.

To enhance speech clarity during strong emotional expressions, IndexTTS2 integrates GPT latent representations. It also introduces a soft instruction mechanism based on textual descriptions, fine-tuned with Qwen3, significantly simplifying emotion control. Users can guide desired emotional tendencies with natural language input.

Experimental results demonstrate that IndexTTS2 surpasses existing state-of-the-art zero-shot TTS models in word error rate, speaker similarity, and emotional fidelity. We will release the model weights and inference code to foster further research and industry adoption.

Feature Demos

IndexTTS2: Controllable Emotional Speech Generation for Audiovisual Dubbing – A Case Study on Iconic Scenes from Let the Bullets Fly

IndexTTS2: Controllable Emotional Speech Generation for Audiovisual Dubbing – A Case Study on Iconic Scenes from Empresses in the Palace (Part 1)

IndexTTS2: Controllable Emotional Speech Generation for Audiovisual Dubbing – A Case Study on Iconic Scenes from Empresses in the Palace (Part 2)

1. Controllability: Zero-Shot TTS with Adjustable Speech Duration
2. Controllability: Zero-Shot TTS with Emotionally Expressive Speech
3. Zero-Shot TTS Comparison With Other Open-Source Models
- 3.1 Monolingual Speech Synthesis

1. Controllability: Zero-Shot TTS with Adjustable Speech Duration

IndexTTS2 achieves, for the first time, precise control over speech duration in autoregressive TTS models. This section demonstrates the model's capability in duration-controlled synthesis, where three different target durations are specified for each test case. Due to the lack of existing autoregressive TTS models with comparable duration control capabilities, we adopt non-autoregressive models as the baselines in this part.

Audio Prompt	Text	Ground Truth	Model	Audio (duration 0.75x)	Audio (duration 1.0x)	Audio (duration 1.25x)
	The equipment needed to do this includes rock saws and polishers.		IndexTTS2
			MaskGCT
			F5-TTS
	There is no wine in this country, the young man said.		IndexTTS2
			MaskGCT
			F5-TTS
	只有当科技为本地社群创造价值的时候，才真正有意义。		IndexTTS2
			MaskGCT
			F5-TTS
	类推可用于颠覆惯性思维，以便为新的创意开路。		IndexTTS2
			MaskGCT
			F5-TTS

2. Controllability: Zero-Shot TTS with Emotionally Expressive Speech

IndexTTS2 is capable of accurately reconstructing the emotional content present in the prompt audio. Thanks to the model’s effective disentanglement of emotional attributes from speaker-related features, users can explicitly control the target emotion by providing an additional emotional audio prompt, enabling the synthesis of speech with specified emotional expressions. Moreover, our framework is well integrated with a natural language-driven emotional modulation mechanism, enabling precise and semantically meaningful emotion customization.

2.1 Using Identical Prompt

We use the same audio prompt across all input conditions, meaning that the reference for both timbre and emotion is based on the exact same audio content. This design ensures that any variation in the generated speech comes purely from the target emotion or timbre, eliminating confounding factors introduced by using different emotional prompts from other timbres.

Emotion	Audio Prompt	Text	IndexTTS2	emotionControl.identicalPrompt.tableHeader.modelOthers[0]	emotionControl.identicalPrompt.tableHeader.modelOthers[1]	emotionControl.identicalPrompt.tableHeader.modelOthers[2]	emotionControl.identicalPrompt.tableHeader.modelOthers[3]	emotionControl.identicalPrompt.tableHeader.modelOthers[4]	emotionControl.identicalPrompt.tableHeader.modelOthers[5]	emotionControl.identicalPrompt.tableHeader.modelOthers[6]
Angry		你在我们屋子里走路的时候，发现路程遥远，这是不足为怪的。
Angry		似乎科琳完成的这身午夜蓝套，裙与旧时代的职业女性并无分别。
Cry		共同建设面向未来的交通，和出行服务新生态
Cry		汤姆，我真愿意信你的话，这样可以一肥遮百丑。
Fear		但到投票前日，内菲斯竟以黑马之姿冲过席尔瓦，日渐下降的支持率。
Fear		过了一会一切都结束了，这座山在月光下显得幽静而静谧。
Depressed		基本上隔一天，小如便会因为不听话而挨揍。
Depressed		狗狗阿黄同志，当森林学校的门卫有五年啦，工作尽职尽责。
Happy		更傻眼的是过了没多久，银行就开始催款了。
Happy		其中一只正又两条前肢，抓住一只有自己身体五倍大的死蜘蛛。
Surprise		他希望能看到灯笼闪一下光，这虽然让他害怕。
Surprise		比如有的业主，贪便宜找马路上的游击队来装修。
Calm		攀爬上官场高位后，开始给家里的各种亲戚安排工作。
Calm		近日，除了葛洲坝股价下跌外，其余三家均有不同程度的上涨。

2.2 Using Different Prompts

We employ distinct audio prompts as references for timbre and emotional expression, respectively, such that speaker-related timbral characteristics and emotion-related prosodic and intonational features are derived from separate audio sources. This ablation study enables a more effective validation of the emotion modulation mechanism, eliminating confounding effects caused by variations in timbre.

Timbre Audio Prompt	Emotion Audio Prompt	Text	Emotion Weight: 0	Emotion Weight: 0.6	Emotion Weight: 1.0	Emotion Weight: 1.4
		这一天，天上的乌云又多又厚又沉，整个森林暗得就像黑夜一样。
		这他妈就是你给的解决方案？老子连续加班三个月，就换来一沓废纸！现在、立刻、马上给我滚出！
		我站在人海中，却感觉比任何时候都要孤独。
		尾号四四九幺的乘客刚夸了你，厉害了我的师傅，你真是个活地图。
		有些人走了就再也没有回来过，所以等待和犹豫是这个世界上最无情的杀手。
		做一个温暖的人，将岁月里的凝重、安暖，写意成简单，将过往的风景，安放在清浅的时光中。

2.3 Using Textual Description

In this alternative approach, we replace the emotional audio reference with a descriptive textual prompt, while maintaining the timbre reference audio unchanged. This allows the model to condition emotional expression solely on text-based emotion cues rather than on audio signals. Such a setup facilitates the investigation of the model’s capability to generate emotionally expressive speech guided by linguistic descriptions, providing insights into the effectiveness of text-driven emotion control without relying on emotion audio prompts.

Timbre Audio Prompt	Emotion Description	Text	Audio
	I feel really down	这究竟是我的福，还是我的孽？岂止是皇上错了，我更是错了！这几年的情爱与时光，究竟是错付了！
	有点快乐，哈哈
	巨巨巨巨巨巨巨巨难过
	I feel really down	Was this my blessing, or my curse? It’s not just the Emperor who was wrong — I was even more mistaken! All these years of love and devotion… in the end, were they nothing but a wasted heart?
	有点快乐，哈哈
	巨巨巨巨巨巨巨巨难过
	书桓走的第一天，想他，想他，想他。	emotionControl.textualDescription.textualZh2
	书桓走的第一天，想他，想他，想他。	On the first day that Shuhan left, all I did was miss him. Miss him. Miss him.
	书桓走的第二天，想他，想他，想他。	emotionControl.textualDescription.textualZh2
	书桓走的第二天，想他，想他，想他。	The second day Shuhan is gone, and still — I miss him. Miss him. Miss him.
	书桓走了第三天了，想他，想他，想他，发疯一样的想他。	emotionControl.textualDescription.textualZh2
	书桓走了第三天了，想他，想他，想他，发疯一样的想他。	The third day Shuhan has been gone… and I still miss him. Miss him. Miss him. I miss him like I've lost my mind.
	超级无敌爆炸angry的情感，就像刚中了彩票被人偷拿了	你问他为什么我没谈恋爱，我就失恋了，你问他，为什么这么对我，我以为我会问，可是我见到他之后，我就不想问了，因为人家根本就不想说，人家甚至都不想见到你，我为什么在那儿犯贱呢，所以我不是放过他，我是想放过我自己，人家不联系你怎么了，不回你微信怎么了，伤害你又怎么了，你算他谁啊?
	又生气又委屈	我为什么非得知道发生了什么呢，我不就是想给自己找一个原因嘛，我只是想找一个原因，我原谅他，可是我为什么非得原谅他呢，我干嘛把自己搞得这么卑微啊?
	我们正在做一些神奇的事情，给我来一种又fear，但是又有点开心的情感。	这游戏太刺激了，心跳都快停了...但我们又能感受到那种挑战未知的兴奋和快乐。说真的，我现在是又害怕又期待，紧张得手心都在冒汗！

3. Zero-Shot TTS Comparison With Other Open-Source Models

3.1 Monolingual Speech Synthesis

Audio Prompt	Text	Ground Truth	Model	Audio (duration 0.75x)	Audio (duration 1.0x)	Audio (duration 1.25x)
	家居养娃的李娜又重新出现在媒体大众的面前		IndexTTS2
	These are two of only three known formations to have dinosaur fossils in Antarctica.		IndexTTS2
	The man looked at him without responding.		IndexTTS2
	胡萝卜凉拌或炒鸡蛋味道都是棒极的，胡萝卜骄傲地说。		IndexTTS2
	rodolfo arrived at his own house without any impediment and leocadia's parents reached theirs heart broken and despairing		IndexTTS2
	那些袖珍衣服挂在架子上，远远看上去就像一幅画，可漂亮了。		IndexTTS2