Gladia believes real-time processing is the next frontier of audio transcription APIs

French startup Gladia, which offers a speech-recognition application programming interface (API), has raised $16 million in a Series A funding round. Essentially, Gladia's API lets you turn any audio file into text with a high level of accuracy and low turnaround time.

While Amazon, Microsoft, and Google all offer speech-to-text APIs as part of their cloud-hosting product suites, they don’t perform as well as newer models offered by specialized startups.

There has been tremendous progress in this field over the past couple of years, especially after the release of Whisper by OpenAI. Gladia competes with other well-funded companies in the space, such as AssemblyAI, Deepgram and Speechmatics.

Gladia originally offered a fine-tuned version of Whisper’s speech-to-text model with some much needed improvements. For instance, the startup supports diarization out of the box — it can detect when there are multiple speakers in a conversation and separate the recording, and transcribed text, depending on who’s talking.

Gladia supports 100 languages and a wide variety of accents. This reporter can confirm that it works, as we’ve been using Gladia to transcribe some interviews, and accents weren’t an issue.

The startup offers its speech-to-text model as a hosted API that users can leverage in their own applications and services. More than 600 companies use Gladia, including several meeting recorders and note-taking assistants like Attention, Circleback, Method Financial, Recall, Sana and Veed.io.

That particular use case is interesting, because many companies have to chain API calls. They first turn speech into text, which they then feed into a large language model (LLM), such as GPT-4o or ‎Claude 3.5 Sonnet, to extract knowledge from large walls of text.

With the new funding, Gladia wants to simplify that pipeline by integrating audio intelligence and LLM-based tasks in a single API call. For instance, a customer could get a conversation summary generated from a handful of bullet points without having to rely on a third-party LLM API.

The other issue that Gladia is looking to solve is latency. You may have seen some demos of real-time audio conversations with an AI-based calling agent (11x has a good demo on its website), and these systems have to be able to transcribe in near real time to make such conversations sound as human-like as possible.

“We realized that real time wasn't very good in terms of quality in the market in general. And people had a weird use case. They were doing real-time processing, and then they were grabbing the audio and running it in batch. We wondered: ‘Why are you doing this?’ They told us: ‘The quality isn't good in real-time processing, so we transcribe it in batch afterwards,’” co-founder and CEO Jean-Louis Quéguiner (pictured above; right) told TechCrunch.