Your phone rings. A voice asks how it can help, listens, answers a follow-up question, and books a meeting. No human is on the line. Behind that one call, three separate AI models are running in a loop, racing a clock measured in milliseconds.
That loop is how AI voice agents work: speech-to-text turns the caller’s words into text, a large language model reads that text and decides what to say, and text-to-speech speaks the reply out loud. Topcalls runs that full round trip in under 500 milliseconds, which is the only reason the call feels like a conversation instead of a switchboard. Here’s every stage, why latency decides whether it works, and where the technology still breaks.
Key Takeaways
- Three models, one loop. Speech-to-text, a large language model, and text-to-speech run in sequence on every turn of the conversation.
- Sub-500ms is the bar. Topcalls targets under 500 milliseconds of response latency because human conversation leaves only about a 200-millisecond gap between turns.
- Whisper-class speech-to-text hits a 2.5% word error rate on clean audio, accurate enough to drive a real-time reply with almost no misheard words.
- Barge-in separates natural from robotic. A good agent stops talking the instant the caller speaks, the way OpenAI’s Realtime API handles interruptions.
- Speech-to-speech is collapsing the pipeline. Newer models fold all three stages into one, cutting the handoffs that used to add hundreds of milliseconds.
- Topcalls runs 63,000+ AI calls a day across 29+ languages at $0.35 per minute, all-inclusive.
1. How Do AI Voice Agents Work?
An AI voice agent works by running three models in a fast loop on every turn of the call. Speech-to-text transcribes what the caller says, a large language model reads the transcript plus the conversation history and writes a reply, and text-to-speech voices it. Topcalls completes the whole cycle in under 500 milliseconds, so the pause feels human instead of mechanical.
The agent doesn’t do this once. It repeats the loop for every back-and-forth in the call. The caller speaks, the loop runs, the agent answers, the caller responds, and the loop fires again. A two-minute call might cycle through it twenty or thirty times.
Sitting on top of those three models is the orchestration layer. It tracks whose turn it is, decides when the caller has finished a sentence, holds the goal of the call, and pulls in live data like a calendar slot or a CRM record mid-conversation. The three models are the engine. The orchestration is the driver.
Topcalls wires this whole stack into a sales workflow, so the same loop that holds the conversation also updates your records and books the meeting. That’s the job of our AI voice agents, built for outbound calling at volume.
2. What Is the STT-LLM-TTS Pipeline?
The STT-LLM-TTS pipeline is the three-stage chain that turns a caller’s voice into the agent’s spoken reply. STT (speech-to-text) transcribes the audio into words, the LLM (large language model) reads those words and generates a response, and TTS (text-to-speech) converts that response back into natural speech. Each stage adds a slice of delay, and the sum is what the caller hears as the agent’s response time.
Speech-to-text: turning sound into words
Speech-to-text, also called automatic speech recognition, listens to the raw audio and writes out the words in near real time. Modern systems are sharp. OpenAI’s Whisper model hits a 2.5% word error rate on clean speech, reported in OpenAI’s own
Whisper research paper (Radford et al., 2022). That means the model gets roughly 39 of every 40 words right before the language model ever sees the transcript. Noise, accents, and crosstalk push that number up, which is why call audio quality matters so much.

Speed counts as much as accuracy here. The speech-to-text stage streams partial transcripts as the caller talks instead of waiting for a full sentence, so the language model can start thinking before the caller has even finished.
The LLM: deciding what to say
The large language model is the brain of the agent. It reads the running transcript, the system prompt that defines the agent’s job and personality, and any live data the orchestration layer fed it. Then it generates the reply, token by token, as text. This is the same class of model behind ChatGPT, tuned and prompted for a calling task instead of open-ended chat.
The LLM is where the agent’s intelligence lives: reading intent, scoring a lead, handling an objection, deciding to transfer to a human. Topcalls uses this stage to run frameworks like BANT lead qualification on every call, so the model isn’t just chatting, it’s working a sales process.
Text-to-speech: speaking the reply
Text-to-speech takes the language model’s words and turns them into audio. Neural text-to-speech doesn’t read text flat. It adds pacing, intonation, emphasis, and small human cues like breaths, so the voice carries the rhythm of real speech. The audio streams out chunk by chunk as the words are generated, so the caller starts hearing the reply before the full sentence is synthesized.
Topcalls runs this across 29+ languages with native-sounding accents, so a Spanish lead hears a Spanish-native voice and a German lead hears a German one, all from the same agent.
3. Why Does Latency Matter on a Call?
Latency matters because human conversation runs on a tight clock. The gap between one person finishing and the next starting averages around 200 milliseconds. When an AI agent takes a full second or two to respond, the caller feels the lag, assumes the line dropped, and starts talking. The whole rhythm collapses. Sub-500ms response time keeps the agent inside the window people read as normal.
That 200-millisecond figure isn’t a guess. A study of ten languages across five continents found the gap between conversational turns clusters tightly around 0 to 300 milliseconds, published in Stivers et al., PNAS (2009). It’s one of the most consistent patterns in human language, which is exactly why a laggy AI voice feels so wrong.
Telecom set its own bar decades ago. The ITU-T G.114 recommendation puts the ceiling for a comfortable one-way phone delay at 150 milliseconds, with quality degrading past 400. An AI voice agent has to fit its entire three-model loop inside a budget that strict, which is why every stage is built to stream rather than wait.
Topcalls targets sub-500ms response latency end to end. That number is the product. Push it past a second and the connect-rate and conversion gains vanish, because callers hang up on a voice that can’t keep pace.

Want to see what faster, always-on calling does to your pipeline numbers? Run the ROI calculator and plug in your own volume.
4. How Does the Agent Handle Interruptions?
A good AI voice agent handles interruptions with barge-in detection. The instant the caller starts speaking, the agent stops its own audio, drops the rest of the planned reply, and listens. Without barge-in, the agent talks over people and reads out a scripted paragraph while the caller is trying to say "wrong number." With it, the agent yields the floor like a person would.
This is now built into the underlying models. OpenAI’s Realtime API streams audio in both directions at once and exposes a cancel mechanism, so when the caller barges in the agent can drop its current response mid-sentence and switch to listening. The detection runs on voice activity, not on waiting for a long pause, so it triggers in real time.
Turn-taking is the harder half of the problem. The agent has to decide when the caller has actually finished, not just paused to think. End too early and you cut people off. End too late and you blow the 500-millisecond budget. Topcalls tunes this per use case, because a quick "yes" needs a different threshold than someone reading out a long address.
5. What Makes a Voice Sound Human?
A voice sounds human when two things line up: the audio carries natural prosody, and the agent responds at a human pace. Neural text-to-speech handles the first by adding intonation, varied pacing, emphasis, and breaths instead of reading words flat. The second comes from the sub-500ms loop. A perfect voice that answers two seconds late still sounds like a machine.
The tell is rarely the voice quality on its own. Today’s neural text-to-speech is convincing enough that, in a short blind clip, most people can’t pick it out. The thing that gives a robot away is timing: the dead air before it answers, the way it talks over you, the flat response when you interrupt. Fix the timing and the voice passes.
Topcalls also supports custom voice cloning on Pro and Enterprise plans, so the agent can carry a specific brand voice across every call instead of a generic stock one.
6. What’s Changing: Speech-to-Speech Models
The newest shift is speech-to-speech models that collapse the three-stage pipeline into one. Instead of transcribing audio, sending text to a language model, and re-synthesizing speech, a single model takes audio in and produces audio out directly. That removes two handoffs, each of which used to cost time, and it keeps tone and emotion that text transcription throws away.
OpenAI’s Realtime API is the best-known example of this approach: one model, audio to audio, with native interruption handling. The payoff is lower latency and a voice that can react to how something was said, not only what was said.

The classic STT-LLM-TTS pipeline isn’t going away, though. It’s more controllable. You can read the transcript, log it, run compliance checks on the text, and swap any one model out. Most production systems, including Topcalls, blend both: the precision of a staged pipeline where it counts, the speed of speech-to-speech where it pays off.
7. Where AI Voice Agents Still Break
AI voice agents break in a few predictable places, and an honest answer names them. Heavy background noise and bad phone connections raise the word error rate, which feeds the language model a garbled transcript. Highly emotional or adversarial callers can knock a scripted agent off track. And deep, nuanced negotiation, the kind that hinges on reading a room over months, is still a human job.
- Noisy audio: crosstalk, wind, and weak signal push speech-to-text error rates up and the agent mishears.
- Rapid interruptions: fire several barge-ins in a row and even strong models can stumble on which response to keep.
- Complex, high-ticket sales: enterprise deals that turn on deep relationship nuance belong with a human rep, not an agent.
- Rare accents and switching languages mid-sentence can still trip the speech-to-text stage.
Where it does fit: high-volume, repetitive calling where speed and consistency win. Lead qualification, appointment setting, follow-ups, reactivation. The work that burns out human reps is exactly where the technology is strongest.
For those plays, Topcalls handles the volume side: booking and confirming appointments around the clock without a rep dialing a single number.
8. How Does This Run at Scale?
Running this pipeline once is a demo. Running it across thousands of simultaneous calls is the engineering. Each live call holds its own STT-LLM-TTS loop, its own conversation state, and its own sub-500ms budget, all at the same time. Topcalls processes 63,000+ AI calls a day at $0.35 per minute, all-inclusive, with a 99.9% uptime SLA.
Scale also means the agent doesn’t work in isolation. Mid-call, the orchestration layer reads and writes to your stack, syncing with Salesforce, HubSpot, and your calendar, so a booked meeting lands on a real calendar and a qualified lead updates a real CRM record the moment the call ends.
Want to see the pipeline run on your own use case? Book a strategy call and we’ll walk through what a campaign looks like for your team.
Frequently Asked Questions
Get AI calling tips in your inbox
No spam. One email per week with actionable sales automation tips.



