Where we are
MyChatBot processes 50M+ messages per month across WhatsApp, Telegram, Instagram, voice and email. We've grown from 1M/month two years ago. Here's what changed in our heads as we scaled.
1. Tail latency is the only latency that matters
We spent the first year optimizing median latency. Median is fine. Customers feel P95 and P99, that's where the bad experiences live. Now every dashboard we look at shows P99 first.
2. Idempotency or you'll cry
Every message must be idempotent. Networks fail. Webhooks retry. If your handler is not idempotent, you'll send duplicate messages to real customers. We learned this the embarrassing way.
3. Backpressure beats rate limits
Rate limits feel safe but they push the problem upstream, usually to your customers, who then complain. Backpressure (slowing down gracefully when downstream is slow) keeps the system stable without surprising anyone.
4. Logs are not metrics
Don't try to derive metrics from logs at scale. Build first-class metrics. Logs are for debugging individual cases; metrics are for understanding the system.
5. Pager hygiene is engineering culture
If your team is paged at night for things that aren't actionable, you'll burn them out. Every pageable alert must have a runbook and a clear action. If neither exists, the alert shouldn't page.