e.g. disk full is a cause of the app being down, not a symptom; you do need to fix it but because of the user impact, not because disks inherently need to be empty
#2 Keep system simple, and log lines easy to read
only the minimum of details needed to be useful
#3 Consoles are :key:
good graphs need labels, etc. good practices
no more than 5 graphs per console (dashboard), no more than 5 plots/lines per graph from monitoring w/ prometheus
#4 Make it wasy to figure out which component is at fault
#5 Create process to address & resolve alerts
escalation procedure
what is the appropriate chain of alerting
Discovering points of Failure
Console graphs? not particularly effective
Latency? somewhat effective
raw? hard to see average/impact
average? useless, doesn't move much
p99 instead, & p100
heatmap of latency
latency of failures vs. successes is important, erroring early is not better; its still erroring
error rates
good for rare response codes
not particularly good for other patterns
request size may also be an error
Traffic demand
RequestPerSecond
not useful for new issues
saturation? fullness
users vs. CPU/mem/etc
evaluate trends in your alerts
outages may need more detailed evaluation of alerts
user metrics (google analytics/new relic/APM)
monitor the monitoring
batch job completions
allow 2 failures before alerting - otherwise increase job rerun rate until 2 missed failures is acceptable