SoulX-Podcast is an innovative speech synthesis model launched by Soul AI Lab, specifically designed for podcast scenarios. It can generate long-form, natural, emotionally rich multi-speaker dialogue speech, authentically reproducing the intonation, pauses, and dialectal features of human conversation.
SoulX-Podcast can generate over 90 minutes of high-quality speech content, perfectly supporting multi-speaker podcasts, virtual interviews, dialogue novels, and other scenarios.
Supports Mandarin, English, and multiple Chinese dialects (Sichuan, Henan, Cantonese, etc.). No training required - achieve Zero-Shot Voice Cloning, giving each character a unique voice and accent.
Built-in rich Paralinguistic Controls — naturally add laughter, sighs, tone changes, and other details to speech, making synthesized speech more infectious and humanized.
The latest model is now available on Hugging Face with improved performance and capabilities.
Project paper officially published on arXiv. View Paper →
git clone https://github.com/Soul-AILab/SoulX-Podcast.githuggingface-cli download Soul-AILab/SoulX-Podcast-1.7Bbash example/infer_dialogue.shAutomated generation of podcasts and interviews
Create virtual characters and AI broadcasters
Research and education in multi-dialect speech
Create audio novels and radio dramas
SoulX-Podcast adopts the Apache 2.0 open source license and can be freely used for research and educational projects.
Please follow ethical guidelines and avoid using it for any unauthorized voice cloning or fraudulent activities.