Introducing Amazon Nova Sonic: A New Gen AI Model for Building Voice Applications and Agents

Business Wire

Tue, Apr 8, 2025, 7:00 AM 8 min read

SEATTLE, April 08, 2025--(BUSINESS WIRE)--Today, Amazon.com Inc (NASDAQ: AMZN) introduced Amazon Nova Sonic, a new foundation model that unifies speech understanding and speech generation into a single model, to enable more human-like voice conversations in artificial intelligence (AI) applications. Available in Amazon Bedrock via a new bi-directional streaming API, the model simplifies the development of voice applications, such as customer service call automation and AI agents across a broad range of industries, including travel, education, healthcare, entertainment, and more.

"From the invention of the world’s best personal AI assistant with Alexa, to developing AWS services like Connect, Lex, and Polly that are used across a wide range of industries, Amazon has long believed that voice-powered applications can make all of our customers’ lives better and easier," said Rohit Prasad, SVP of Amazon Artificial General Intelligence. "With Amazon Nova Sonic, we are releasing a new foundation model in Amazon Bedrock that makes it simpler for developers to build voice-powered applications that can complete tasks for customers with higher accuracy, while being more natural, and engaging."

Traditional approaches to building voice-enabled applications involve complex orchestration of multiple models, such as speech recognition to convert speech to text, large language models (LLMs) to understand and generate responses, and text-to-speech to convert text back to audio. This fragmented approach not only increases development complexity but also fails to preserve crucial acoustic context and nuances like tone, prosody, and speaking style that are essential for natural conversations.

Nova Sonic solves these challenges through a uniﬁed model architecture that delivers speech understanding and generation, without requiring a separate model for each of these steps. This unification enables the model to adapt the generated voice response to the acoustic context (e.g. tone, style) and the spoken input, resulting in more natural dialog. Nova Sonic even understands the nuances of human conversation, including the speaker’s natural pauses and hesitations, waiting to speak until the appropriate time, and gracefully handling barge-ins. It also generates a text transcript for the user’s speech, enabling developers to use that text to call specific tools and APIs for building voice-enabled AI agents (e.g., an AI-powered travel agent that can book flights by retrieving up to date flight information). These capabilities, along with its lightning-fast inference, make voice applications powered by Nova Sonic more natural and useful.

State-of-the-art accuracy and quality

Nova Sonic has been rigorously tested against a wide range of industry standard benchmarks for speech understanding and generation, demonstrating exceptional quality and accuracy for human-like, real-time voice conversations.

The model excels in natural dialog handling, seamlessly understanding and adapting to pauses, hesitations, and interruptions while maintaining conversational context throughout the interaction. This capability contributed to strong performance for overall quality and accuracy in turn-taking tests.

Nova Sonic demonstrates strong performance on overall conversation quality compared to other models in the industry, which at this time include a select few with similar real-time conversational speech capabilities, such as OpenAI's GPT-4o (Realtime) and Google Gemini Flash 2.0 (available via Gemini’s experimental live API). For example, single-turn dialogs in its American English masculine-sounding voice achieved a 51.0% and 69.7% win-rate against OpenAI’s GPT-4o (Realtime) and Google’s Gemini Flash 2.0 respectively, based on the Common Eval data set. Likewise, Nova Sonic’s American English feminine-sounding voice scored 50.9% and 66.3% win-rate against OpenAI’s GPT-4o (Realtime) and Google’s Gemini Flash 2.0 respectively on the same data set. Nova Sonic also exceeds performance for its British English feminine-sounding voice, scoring a 58.3% win-rate against OpenAI’s GPT-4o (Realtime).

Since recognizing spoken words is critical in generating accurate responses, measuring Nova Sonic's speech recognition accuracy in terms of word error rate (WER) across a wide range of languages, dialects, and accents is also critical. On the Multilingual LibriSpeech, Nova Sonic achieved a WER of 4.2%, which is 36.4% relative lower than OpenAI's GPT-4o Transcribe model, when averaged across English, French, Italian, German and Spanish.

On English utterances of the Multilingual LibriSpeech (MLS) data set, it has 24.2% relative lower WER compared to OpenAI’s GPT-4o Transcribe model.

Nova Sonic is also robust to noisy conditions, with 46.7% relative lower WER for English compared to OpenAI’s GPT-4o Transcribe model measured on Augmented Multi Party Interaction (AMI) meeting benchmark that consists of real-world noisy and multi-speaker interactions.

Tool-use for function calling and agentic workflows

Nova Sonic also supports tool-use for applications—like customer service call automation—that require the responses to be factually grounded in enterprise data, such as pricing plans, available inventory, and schedule availability. Nova Sonic’s native tool-use also enables the model to resolve complex customer queries and complete tasks on behalf of customers, for example, "make a reservation" or "find alternate flights."

Multiple native voices and speaking styles

Nova Sonic supports three expressive voices, including both masculine-sounding and feminine-sounding voices now generally available in English, and supports speech generation in different English accents including American and British. Support for additional languages and accents will be coming soon.

Industry-leading speed and price performance

Nova Sonic delivers an average customer-perceived latency of 1.09 seconds from the time the customer is done talking to the time the system generates the first speech response. This is compared to 1.18 seconds for OpenAI’s GPT-4o (Realtime), and 1.41 seconds for Google’s Gemini Flash 2.0 (available via Gemini’s experimental live API), per benchmarking by Artificial Analysis.

Nova Sonic is the most cost-efficient model in the industry, when compared to models that have similar functionality of real-time speech conversations and have public pricing available. For example, it is nearly 80% less expensive than OpenAI’s GPT-4o (Realtime).

Amazon Nova Sonic is helping companies drive better customer satisfaction and productivity

ASAPP empowers enterprise customers’ contact centers to deliver unmatched customer service through GenerativeAgent, a fully conversational generative Al voice agent. "At ASAPP, we are focused on using generative AI to deliver reliable, secure, and high-performing solutions for improving customer service in contact centers. We’ve been particularly impressed by Amazon Nova Sonic’s highly accurate speech understanding capabilities which allow for more natural voice interactions and precise dialog handling over telephony," said Nirmal Mukhi, VP of AI Engineering at ASAPP. "We’re excited to continue using Nova Sonic to deliver secure, high-quality, and precise conversations that meet the demands of enterprise contact centers."

Education First (EF) is a leader in international education through its networks of schools and offices in over 50 countries. "Amazon Nova Sonic enables EF students to practice new vocabulary and refine their pronunciation in a dynamic learning environment, while the interactive nature of the model allows students to receive immediate feedback on their pronunciation attempts, contributing to a more efficient and effective learning process. The model is capable of accurately understanding non-native English speakers with a variety of accents. We were also impressed with the barge-in feature of Nova Sonic, where the model quickly reacts to interruptions," said Tim Hesse, VP of AI and Data at EF. "The scalability and reliability of the technology will allow us to expand our capacity to serve a larger student population simultaneously, without compromising the quality of instruction."

Stats Perform is a sports data and AI technology provider, serving global media organizations, betting operators, and professional sports teams. "At Stats Perform, our goal is to empower the world’s top sports broadcasters, media, federations and teams with magic in the detail of our vast live and historical Opta sports dataset, to help them win audiences, customers and trophies. With the Opta AI Chat they can generate unique, accurate, and contextual responses, driven by live data insights with remarkable speed, in multiple formats and languages, to find a winning analytical or storytelling edge," said Mike Perez, Chief Operating Officer at Stats Perform. "We’ve been testing Amazon Nova Sonic and have been particularly impressed by the system's low latency, which enables near-instantaneous responses even to complex queries of our model, creating a seamless user experience that turns human experts into superhuman experts. The intuitive prompting capability and ease of setup have exceeded our expectations, making implementation simple. Overall, Nova Sonic has proven to be a fantastic solution."

Amazon is committed to the responsible development of artificial intelligence

Amazon Nova models are built with integrated safety measures and protections. The company has launched AWS AI Service Cards for Nova models, offering transparent information on use cases, limitations, and responsible AI practices.

To get started with Amazon Nova models, visit: https://aws.amazon.com/nova/

To learn more, visit: About Amazon for details on today’s announcement.

View source version on businesswire.com: https://www.businesswire.com/news/home/20250408227167/en/

Contacts

Amazon.com, Inc.
Media Hotline
Amazon-pr@amazon.com
www.amazon.com/pr

Terms and Privacy Policy

Your Privacy Choices

Utah Privacy Notice

More Info