Evals are 80% of prompt engineering
Most teams write prompts and ship. We don't ship a prompt change without running it against a 200-case eval suite first. Production incidents from prompt changes have dropped to near zero since we built this.
Treat prompts like code
Every prompt is in git. Every change goes through PR review. Every change has a revert path. This sounds obvious; many teams don't do it.
The rule that saved us
If a prompt change improves quality on average but makes the worst case worse, don't ship it.
Average quality is fine but the experience customers remember is the worst-case interaction. We optimize for tail performance, not mean. This rule has caught many shiny prompt changes that would have made us look great in evals and worse in real life.