All posts
AI Research 10 min read

Comparing GPT-4, Claude and our in-house model for voice

We benchmarked five models on a 5,000-utterance voice agent task. Here's what we found about quality, latency and cost tradeoffs.

Anna Roman
Lead Researcher
Mar 20, 2026

The benchmark

5,000 voice agent utterances drawn from real customer conversations (anonymized). Three task types: small talk, structured information gathering, multi-step reasoning. Models compared: GPT-4o, Claude 3.5 Sonnet, our in-house 13B, our in-house 70B, and a popular open-source 70B.

Findings

On small talk, all models are basically equal. Quality differences are noise. Latency and cost dominate the choice.

On structured information gathering, Claude and our 13B win. They follow instructions more reliably.

On multi-step reasoning, GPT-4o and our 70B lead. Significant gap to other models.

Benchmark results bar chart
Benchmark results bar chart

What this means

There's no "best" model, there's the best model for each turn. Routing per-turn beats picking one model for everything, by a lot.

Our production system routes ~70% of turns to small fast models, ~25% to medium, ~5% to the largest. Quality is held; cost drops significantly.

#ml#voice#benchmarks
Anna Roman
Lead Researcher

Leads our applied ML research. Published widely on multi-agent systems. Believes good evals are 80% of good AI.

Try MyChatBot for free

Set up your first AI agent in 10 minutes. No credit card required.

Start free trial