How we cut voice latency under 300ms, MyChatBot Blog

The latency problem

Voice AI has a brutal latency budget. Anything over 500ms feels like a delay. Over 1 second feels broken. Our v1 pipeline averaged 600ms, fine for product, not great for the bar we wanted to hit.

Latency in voice pipelines is dominated by the chain of services: ASR → LLM → TTS, plus network round trips. Each stage waits for the previous to finish. With 200ms ASR + 250ms LLM + 200ms TTS, you're already at 650ms.

Streaming partials between every stage

The biggest win came from never waiting for a stage to finish. ASR streams partial transcripts as the user speaks. The LLM gets fed partials and starts generating. TTS streams audio chunks back before the LLM is done.

This has tradeoffs. ASR partials are noisy, they revise as more audio comes in. The LLM has to handle that gracefully. We added a thin layer that buffers the last 150ms of partials before flushing to LLM, which absorbs most revisions.

Picking the right model per turn

Not every turn needs a 400B-param model. Most don't. We classify incoming turns into 4 buckets, small-talk, factual lookup, reasoning, complex multi-step, and route to a model sized for each.

Small-talk goes to a 7B model with 80ms inference. Reasoning goes to a 70B model. Complex multi-step gets the big stuff. Our classifier itself is 200M params and runs in under 10ms.

End result

Median first-token-out latency: 180ms. P95: 280ms. P99: 420ms. The slowest tier of conversations are still under the perceptual threshold for most users.

Cost dropped too, we use the small model 70% of the time, the medium model 25%, and the big one only 5%.

#voice#performance#engineering

Yaroslav Demir

Principal Engineer

Owns platform reliability. 10+ years building high-throughput systems. Will defend Go in any thread.

How we cut voice latency under 300ms

The latency problem

Streaming partials between every stage

Picking the right model per turn

End result

Try MyChatBot for free

More from Engineering

Building a multi-tenant agent runtime

Lessons from running 50M+ messages a month

Voice Agent v2: 3× faster, 40% cheaper, in 14 languages

Save your agent to continue