11/12/2018 - 5:41 AM

SeaGl 2018 - Monitoring and Alerting: Knowing the Unknown

Monitoring & Alerting

Amanda Sopkin (amsopkin at- gmail.com)

causes may not actually effect users
e.g. disk full is a cause of the app being down, not a symptom; you do need to fix it but because of the user impact, not because disks inherently need to be empty

good graphs need labels, etc. good practices
no more than 5 graphs per console (dashboard), no more than 5 plots/lines per graph from monitoring w/ prometheus

Console graphs? not particularly effective
Latency? somewhat effective
- raw? hard to see average/impact
- average? useless, doesn't move much
- p99 instead, & p100
- heatmap of latency
- latency of failures vs. successes is important, erroring early is not better; its still erroring
error rates
- good for rare response codes
- not particularly good for other patterns
- request size may also be an error
Traffic demand
- RequestPerSecond
- not useful for new issues
saturation? fullness
- users vs. CPU/mem/etc

monitor the monitoring
batch job completions
- allow 2 failures before alerting - otherwise increase job rerun rate until 2 missed failures is acceptable