Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

best way to statistically detect anomalies in data

our webapp collects huge amount of data about user actions, network business, database load, etc etc etc

All data is stored in warehouses and we have quite a lot of interesting views on this data.

if something odd happens chances are, it shows up somewhere in the data.

However, to manually detect if something out of the ordinary is going on, one has to continually look through this data, and look for oddities.

My question: what is the best way to detect changes in dynamic data which can be seen as 'out of the ordinary'.

Are bayesan filters (I've seen these mentioned when reading about spam detection) the way to go?

Any pointers would be great!

EDIT: To clarify the data for example shows a daily curve of database load. This curve typically looks similar to the curve from yesterday In time this curve might change slowly.

It would be nice that if the curve from day to day changes say within some perimeters, a warning could go off.

R

like image 820
Toad Avatar asked Aug 20 '09 15:08

Toad


People also ask

What are the three 3 basic approaches to anomaly detection?

There are three main classes of anomaly detection techniques: unsupervised, semi-supervised, and supervised. Essentially, the correct anomaly detection method depends on the available labels in the dataset.

How do you measure anomaly detection?

To perform and evaluate anomaly detection on time series data, non-traditional performance metrics are needed. In many cases, standard point-based metrics are used when evaluating detections. However, standard point-based metrics cannot accurately evaluate anomaly detection on time series data.

What is the best metric for anomaly detection?

Beyond accuracy, the most commonly used metrics when evaluating anomaly detection solutions are F1, Precision and Recall. One can think about these metrics in the following way: Recall is used to answer the question: What proportion of true anomalies was identified?

When data contains anomalies which is best to use?

When anomalies exist in the data median gives a correct value than the mean because the median sorts the values and finds the middle position in the data whereas the mean just averages the values in the data. To find the outliers in the right and left side of the data you use Q3+1.5(IQR), Q1-1.5(IQR).


2 Answers

Take a look at Control Charts, they provide a way to track changes in your data visually and specify when the data is "out of control" or "anomalous". They are heavily used in manufacturing to ensure quality control.

like image 183
Carlos Rendon Avatar answered Sep 28 '22 00:09

Carlos Rendon


This question is impossible to answer without knowing much more about the particular data you have. For an overview of what kinds of approaches exist, see Anomaly Detection: A Survey by Chandola, Banerjee, and Kumar.

like image 25
Jouni K. Seppänen Avatar answered Sep 28 '22 02:09

Jouni K. Seppänen