finchd
11/12/2018 - 5:41 AM

SeaGl 2018 - Monitoring and Alerting: Knowing the Unknown

SeaGl 2018 - Monitoring and Alerting: Knowing the Unknown

Monitoring & Alerting

Amanda Sopkin (amsopkin at- gmail.com)

#1 Alert on Symptoms, not causes

  • causes may not actually effect users
  • e.g. disk full is a cause of the app being down, not a symptom; you do need to fix it but because of the user impact, not because disks inherently need to be empty

#2 Keep system simple, and log lines easy to read

  • only the minimum of details needed to be useful

#3 Consoles are :key:

  • good graphs need labels, etc. good practices
  • no more than 5 graphs per console (dashboard), no more than 5 plots/lines per graph from monitoring w/ prometheus

#4 Make it wasy to figure out which component is at fault

#5 Create process to address & resolve alerts

  • escalation procedure
  • what is the appropriate chain of alerting

Discovering points of Failure

  • Console graphs? not particularly effective
  • Latency? somewhat effective
    • raw? hard to see average/impact
    • average? useless, doesn't move much
    • p99 instead, & p100
    • heatmap of latency
    • latency of failures vs. successes is important, erroring early is not better; its still erroring
  • error rates
    • good for rare response codes
    • not particularly good for other patterns
    • request size may also be an error
  • Traffic demand
    • RequestPerSecond
    • not useful for new issues
  • saturation? fullness
    • users vs. CPU/mem/etc
  • evaluate trends in your alerts
  • outages may need more detailed evaluation of alerts
  • user metrics (google analytics/new relic/APM)
  • monitor the monitoring
  • batch job completions
    • allow 2 failures before alerting - otherwise increase job rerun rate until 2 missed failures is acceptable