szaydel
7/21/2017 - 9:27 PM

[RackTop Quick Presentations] #tags: presentations, slides, markdown

[RackTop Quick Presentations] #tags: presentations, slides, markdown

<!-- $theme: gaia -->
<!-- $size: 16:9 -->
<!-- page_number: true -->

# Important Metrics for All Disks
* Number of IOs per interval, IOPS is common (i.e. per second)
* Throughput
* IO size, which directly limits throughput
* Latency
* Random or Sequential IO?
* IO direction bias (Reads v. Writes)

---

# How to think about these numbers
* We want to describe reasoning behind these numbers and what they can tell an average consumer
* There are two obvious areas, one being health, and one performance

---

## Health
* Is number of IOs substentially different between all like drives?
* Is latency vastly different between two like drives?
* Are number of bytes similar between all like drives?
* Do any drives show much more extreme observations? Define Extreme...
* How much active time? Define Active...
* What about IO errors?

---

## Performance
* Are my IOs large or small, and why does it matter?
* Can I satisfy throughput requirement of X?
* Do I experience high latency, what is high anyway?
* Pending IOs inform about how busy devices are

---

# Summarizing Data
* Expected Values, Medians, Sums, Mins, Maxs, Buckets
* Limit loss of insights
* Averages tend to obscure structure

---

## Expected Values, Medians, Sums, Mins, Maxs, Buckets
* Most measurements benefit from reporting a mean and extremely low and high obsrvations, or rage as MAX - MIN
* Median not biased by high values like latency spikes or IO stalls, but expensive
* Percentiles are useful for presenting ranges and tendency of data
* Percentile calculation in dtrace is expensive and lacks real number support, but easy in Influx

---

## Limit loss of insights
* Avoid reporting ONLY averages without a range
* Histograms allow for categorical or numerical ranges to be distilled
* Categorical ranges like *high*, *normal*, *low* can be useful for latency grouping
* Numerical ranges allow for more precise summary of a specific metric than Categorical; meaningful for IOPS, latency, IO size, etc.

---

## Averages tend to obscure structure
* Average = sum(N) / N influenced by extreme values,
* but hides their true significance
* Two drives may do vastly different amount of IO, but have same average latency
* Average may suggest much lower or higher expected value than reality when N is small and outliers exist, i.e. **avg([1, 1000, 30, 30000]) = 7757.75**
* No sense of how common or exceptional large or small values are