The benchmark
5,000 voice agent utterances drawn from real customer conversations (anonymized). Three task types: small talk, structured information gathering, multi-step reasoning. Models compared: GPT-4o, Claude 3.5 Sonnet, our in-house 13B, our in-house 70B, and a popular open-source 70B.
Findings
On small talk, all models are basically equal. Quality differences are noise. Latency and cost dominate the choice.
On structured information gathering, Claude and our 13B win. They follow instructions more reliably.
On multi-step reasoning, GPT-4o and our 70B lead. Significant gap to other models.
What this means
There's no "best" model, there's the best model for each turn. Routing per-turn beats picking one model for everything, by a lot.
Our production system routes ~70% of turns to small fast models, ~25% to medium, ~5% to the largest. Quality is held; cost drops significantly.