Sesame has unveiled its latest research on conversational voice technology, focusing on achieving "voice presence"—a quality that makes interactions with digital assistants feel authentic and emotionally resonant. The announcement, made on February 27, 2025, highlights the limitations of current voice assistants, which often fail to engage users due to their emotionally neutral tone. Sesame aims to address this gap by developing AI companions capable of understanding and responding to emotional cues, conversational dynamics, and contextual nuances.
At Sesame, we believe in a future where computers are lifelike. Today we are unveiling an early glimpse of our expressive voice technology, highlighting our focus on lifelike interactions and our vision for all-day wearable voice companions. https://t.co/Edp8V8urgC pic.twitter.com/Mc5nWnBJZM
— Sesame (@sesame) February 27, 2025
The centerpiece of this initiative is the Conversational Speech Model (CSM), a new approach to speech generation that uses multimodal learning with transformers. Unlike traditional text-to-speech (TTS) systems, CSM integrates text and audio context to produce speech that adapts to conversational history, tone, and rhythm. This model operates as a single-stage system for efficiency and expressivity, leveraging semantic and acoustic tokens for high-fidelity audio reconstruction. Key advancements include low-latency generation and a compute-efficient training process.
Sesame's research emphasizes several core components for achieving natural voice interactions:
- Emotional intelligence
- Contextual awareness
- Conversational timing
- Consistent personality
The team has demonstrated progress in areas like pronunciation correction, contextual expressivity, and handling multi-speaker conversations. However, challenges remain in fully replicating human-like prosody and conversational flow.
To evaluate CSM's performance, Sesame introduced new benchmarks such as Homograph Disambiguation and Pronunciation Continuation Consistency. These metrics assess the model's ability to adapt pronunciation based on context. Subjective evaluations using the Expresso dataset revealed that while CSM-generated speech approaches human-like quality in isolated samples, it still lags in maintaining conversational appropriateness when compared to human recordings.
The company plans to expand the model’s capabilities by scaling up its dataset and adding support for over 20 languages. Future efforts will focus on creating duplex models that can seamlessly manage turn-taking and pacing in conversations. As part of its commitment to collaboration, Sesame intends to open-source key components of its research under an Apache 2.0 license.
This development positions Sesame as a key player in advancing conversational AI, addressing both technical challenges and user expectations for more human-like digital interactions.