Prompt engineering at production scale, MyChatBot Blog

Evals are 80% of prompt engineering

Most teams write prompts and ship. We don't ship a prompt change without running it against a 200-case eval suite first. Production incidents from prompt changes have dropped to near zero since we built this.

Treat prompts like code

Every prompt is in git. Every change goes through PR review. Every change has a revert path. This sounds obvious; many teams don't do it.

The rule that saved us

If a prompt change improves quality on average but makes the worst case worse, don't ship it.

Average quality is fine but the experience customers remember is the worst-case interaction. We optimize for tail performance, not mean. This rule has caught many shiny prompt changes that would have made us look great in evals and worse in real life.

#prompts#ml#production

Anna Roman

Lead Researcher

Leads our applied ML research. Published widely on multi-agent systems. Believes good evals are 80% of good AI.

Prompt engineering at production scale

Evals are 80% of prompt engineering

Treat prompts like code

The rule that saved us

Try MyChatBot for free

More from AI Research

Why we trained our own embedding model

Comparing GPT-4, Claude and our in-house model for voice

Voice Agent v2: 3× faster, 40% cheaper, in 14 languages

Save your agent to continue