Zyphra has announced the beta release of Zonos-v0.1, a suite of open-source text-to-speech (TTS) models featuring high-fidelity voice cloning and real-time capabilities. The release includes two 1.6 billion parameter models: a transformer-based model and a hybrid model leveraging state-space models (SSM). Both are available under the permissive Apache 2.0 license, making them accessible to researchers and developers via platforms like Huggingface and GitHub.
The Zonos models were trained on a vast dataset of approximately 200,000 hours of speech data, covering neutral tones such as audiobook narration and highly expressive speech. While the majority of the dataset is in English, it also includes significant portions of Chinese, Japanese, French, Spanish, and German. However, the models exhibit limited performance on less-represented languages.
Today, we're excited to announce a beta release of Zonos, a highly expressive TTS model with high fidelity voice cloning.
— Zyphra (@ZyphraAI) February 10, 2025
We release both transformer and SSM-hybrid models under an Apache 2.0 license.
Zonos performs well vs leading TTS providers in quality and expressiveness. pic.twitter.com/jaliZNJecm
Zonos supports advanced features such as high-fidelity voice cloning based on 5-30 second audio samples, conditioning inputs for emotions (e.g., sadness, anger), speaking rate, pitch, and audio quality. The models generate speech at a native 44kHz resolution. The hybrid model is optimized for lower latency and reduced memory usage compared to its transformer counterpart, thanks to its Mamba2-based architecture.
Despite its strengths, Zonos faces limitations such as occasional audio artifacts and alignment issues in text generation. The high-bitrate autoencoder used in the models ensures superior quality but increases computational costs during inference. On high-end GPUs like the NVIDIA RTX 4090, Zonos achieves a latency of 200-300ms with a real-time factor above 1.
Zyphra aims to address these challenges in future updates by improving language support, pronunciation accuracy, emotional control, and inference efficiency. The company has positioned Zonos as a competitor to proprietary TTS solutions like ElevenLabs while advancing open-source research in audio generation.