— working note 03 / mmxxvi · time-series —

Time-Series
Anomaly Detection

Amanuel Bogale, B.S. Computer Science  ·  Kohki Nishio, B.S. Electronics & Information Science  ·  Zikai Zhu, M.S. Data Science

This paper surveys methodologies for finding anomalies in unsupervised DevOps metrics, comparing them through benchmarks across statistical, neural, and clustering categories. Methods are organized along a second axis — rate, error, and duration (RED) — which we found mattered more than the family axis when predicting which detector would win on a given metric. We draw on Oracle's internal Grafana telemetry (CPU and disk utilization, memory pressure, network traffic), Yahoo WebScope data with permission, and a handful of open-source CSV datasets including COVID retention rates and car-accident rates. Most of the methodologies here are state-of-the-art — either widely deployed already or trending in the past year of literature. The paper treats data as vectors in some sections and as distributions in others. Let's jump in.

I.Introduction

When working with infrastructure at the scale of Oracle Cloud Infrastructure (OCI) you deal with millions of metrics representing many different things. Some of those metrics dictate the health of the system itself — that is the territory we care about here. The detection of anomalous events allows us to surface long-range outliers and to understand operational issues. The system alerts developers whenever events of similar nature to past anomalies occur, which results in swift troubleshooting.

Anomalies that pass through the detection system — false positives — can be detrimental to the infrastructure, capable of causing significant losses and impacting customers. Standard data points that are flagged as anomalies — false negatives — are similarly damaging: they're noise that wastes developer attention, and they push humans to rely on their own judgment about which alerts are real, at which point a genuine low-profile anomaly can be quietly ignored.

Before building an anomaly detection system we have to understand what an anomaly is. The terms are often ambivalent and commonly conflated with outliers. An outlier is a data point that lies at an abnormal distance from other values in a random sample from a population — detection here is mostly about each data point's behavior in accordance with the others. According to Homin Lee, a Staff Datadog Data Scientist, with outlier detection we compare metrics that should be behaving correspondingly to one another. A sudden huge jump of a stock at one given granularity from a five-year view is an outlier.

An anomaly, on the other hand, is an unexpected change within a metric's current data patterns — an event that does not conform to its expected pattern. A deviation from the business as usual. For anomaly detection we use the metric's history to see whether or not it's diverging from where we think it should be. Examples include behaviors that follow seasonal trends where you have a range of historical points to compare against the current one — quite distinct from the stock-drop case above.

I·a.Why thresholds aren't enough

Previously, our alarm system detected abnormalities through static thresholds — when the data simply exceeded the threshold, the person monitoring the metric would be alerted. This approach works for smaller-scale businesses, but we have multiple regions with millions of metrics. Some regions, like our Austin servers, have different thresholds than those in Ashburn, Virginia. The thresholds also vary by time of day, day of the week, and seasonal change. Due to all of these factors, last year we moved on to an anomaly-detection system — but that introduced a separate set of issues.

In last year's case there were numerous reported false positives, which meant accuracy itself became a hindrance. There were also reported lags when calculating anomalies, which sometimes resulted in crashes for the system itself.

amanuel.bogale@oracle.com · kohki.nishio@oracle.com · zikai.zhu@oracle.com