All posts
AI Research 8 min read

Prompt engineering at production scale

Tactics for prompts that hold up across millions of conversations, including evals, version control and the rule that saved us many incidents.

Anna Roman
Lead Researcher
Feb 28, 2026

Evals are 80% of prompt engineering

Most teams write prompts and ship. We don't ship a prompt change without running it against a 200-case eval suite first. Production incidents from prompt changes have dropped to near zero since we built this.

Treat prompts like code

Every prompt is in git. Every change goes through PR review. Every change has a revert path. This sounds obvious; many teams don't do it.

The rule that saved us

If a prompt change improves quality on average but makes the worst case worse, don't ship it.

Average quality is fine but the experience customers remember is the worst-case interaction. We optimize for tail performance, not mean. This rule has caught many shiny prompt changes that would have made us look great in evals and worse in real life.

#prompts#ml#production
Anna Roman
Lead Researcher

Leads our applied ML research. Published widely on multi-agent systems. Believes good evals are 80% of good AI.

Try MyChatBot for free

Set up your first AI agent in 10 minutes. No credit card required.

Start free trial