The latency problem
Voice AI has a brutal latency budget. Anything over 500ms feels like a delay. Over 1 second feels broken. Our v1 pipeline averaged 600ms, fine for product, not great for the bar we wanted to hit.
Latency in voice pipelines is dominated by the chain of services: ASR → LLM → TTS, plus network round trips. Each stage waits for the previous to finish. With 200ms ASR + 250ms LLM + 200ms TTS, you're already at 650ms.
Streaming partials between every stage
The biggest win came from never waiting for a stage to finish. ASR streams partial transcripts as the user speaks. The LLM gets fed partials and starts generating. TTS streams audio chunks back before the LLM is done.
This has tradeoffs. ASR partials are noisy, they revise as more audio comes in. The LLM has to handle that gracefully. We added a thin layer that buffers the last 150ms of partials before flushing to LLM, which absorbs most revisions.
Picking the right model per turn
Not every turn needs a 400B-param model. Most don't. We classify incoming turns into 4 buckets, small-talk, factual lookup, reasoning, complex multi-step, and route to a model sized for each.
Small-talk goes to a 7B model with 80ms inference. Reasoning goes to a 70B model. Complex multi-step gets the big stuff. Our classifier itself is 200M params and runs in under 10ms.
End result
Median first-token-out latency: 180ms. P95: 280ms. P99: 420ms. The slowest tier of conversations are still under the perceptual threshold for most users.
Cost dropped too, we use the small model 70% of the time, the medium model 25%, and the big one only 5%.