Comparing GPT-4, Claude and our in-house model for…, MyChatBot Blog

The benchmark

5,000 voice agent utterances drawn from real customer conversations (anonymized). Three task types: small talk, structured information gathering, multi-step reasoning. Models compared: GPT-4o, Claude 3.5 Sonnet, our in-house 13B, our in-house 70B, and a popular open-source 70B.

Findings

On small talk, all models are basically equal. Quality differences are noise. Latency and cost dominate the choice.

On structured information gathering, Claude and our 13B win. They follow instructions more reliably.

On multi-step reasoning, GPT-4o and our 70B lead. Significant gap to other models.

What this means

There's no "best" model, there's the best model for each turn. Routing per-turn beats picking one model for everything, by a lot.

Our production system routes ~70% of turns to small fast models, ~25% to medium, ~5% to the largest. Quality is held; cost drops significantly.

#ml#voice#benchmarks

Anna Roman

Lead Researcher

Leads our applied ML research. Published widely on multi-agent systems. Believes good evals are 80% of good AI.

Comparing GPT-4, Claude and our in-house model for voice

The benchmark

Findings

What this means

Try MyChatBot for free

Comparing GPT-4, Claude and our in-house model for voice

The benchmark

Findings

What this means

Try MyChatBot for free

More from AI Research

Why we trained our own embedding model

Prompt engineering at production scale

Voice Agent v2: 3× faster, 40% cheaper, in 14 languages

Save your agent to continue