Recommended anomaly detection technique for simple, one-dimensional scenario?

Tags:

classification

I have a scenario where I have several thousand instances of data. The data itself is represented as a single integer value. I want to be able to detect when an instance is an extreme outlier.

For example, with the following example data:

a = 10 b = 14 c = 25 d = 467 e = 12

d is clearly an anomaly, and I would want to perform a specific action based on this.

I was tempted to just try an use my knowledge of the particular domain to detect anomalies. For instance, figure out a distance from the mean value that is useful, and check for that, based on heuristics. However, I think it's probably better if I investigate more general, robust anomaly detection techniques, which have some theory behind them.

Since my working knowledge of mathematics is limited, I'm hoping to find a technique which is simple, such as using standard deviation. Hopefully the single-dimensioned nature of the data will make this quite a common problem, but if more information for the scenario is required please leave a comment and I will give more info.

Edit: thought I'd add more information about the data and what I've tried in case it makes one answer more correct than another.

The values are all positive and non-zero. I expect that the values will form a normal distribution. This expectation is based on an intuition of the domain rather than through analysis, if this is not a bad thing to assume, please let me know. In terms of clustering, unless there's also standard algorithms to choose a k-value, I would find it hard to provide this value to a k-Means algorithm.

The action I want to take for an outlier/anomaly is to present it to the user, and recommend that the data point is basically removed from the data set (I won't get in to how they would do that, but it makes sense for my domain), thus it will not be used as input to another function.

So far I have tried three-sigma, and the IQR outlier test on my limited data set. IQR flags values which are not extreme enough, three-sigma points out instances which better fit with my intuition of the domain.

Information on algorithms, techniques or links to resources to learn about this specific scenario are valid and welcome answers.

What is a recommended anomaly detection technique for simple, one-dimensional data?

951

asked Feb 20 '10 20:02

Grundlefleck

2 Answers

Check out the three-sigma rule:

mu  = mean of the data std = standard deviation of the data IF abs(x-mu) > 3*std  THEN  x is outlier

An alternative method is the IQR outlier test:

Q25 = 25th_percentile Q75 = 75th_percentile IQR = Q75 - Q25         // inter-quartile range IF (x < Q25 - 1.5*IQR) OR (Q75 + 1.5*IQR < x) THEN  x is a mild outlier IF (x < Q25 - 3.0*IQR) OR (Q75 + 3.0*IQR < x) THEN  x is an extreme outlier

this test is usually employed by Box plots (indicated by the whiskers):

boxplot

EDIT:

For your case (simple 1D univariate data), I think my first answer is well suited. That however isn't applicable to multivariate data.

@smaclell suggested using K-means to find the outliers. Beside the fact that it is mainly a clustering algorithm (not really an outlier detection technique), the problem with k-means is that it requires knowing in advance a good value for the number of clusters K.

A better suited technique is the DBSCAN: a density-based clustering algorithm. Basically it grows regions with sufficiently high density into clusters which will be maximal set of density-connected points.

dbscan_clustering

DBSCAN requires two parameters: epsilon and minPoints. It starts with an arbitrary point that has not been visited. It then finds all the neighbor points within distance epsilon of the starting point.

If the number of neighbors is greater than or equal to minPoints, a cluster is formed. The starting point and its neighbors are added to this cluster and the starting point is marked as visited. The algorithm then repeats the evaluation process for all the neighbors recursively.

If the number of neighbors is less than minPoints, the point is marked as noise.

If a cluster is fully expanded (all points within reach are visited) then the algorithm proceeds to iterate through the remaining unvisited points until they are depleted.

Finally the set of all points marked as noise are considered outliers.

181

answered Oct 07 '22 11:10

Amro

There are a variety of clustering techniques you could use to try to identify central tendencies within your data. One such algorithm we used heavily in my pattern recognition course was K-Means. This would allow you to identify whether there are more than one related sets of data, such as a bimodal distribution. This does require you having some knowledge of how many clusters to expect but is fairly efficient and easy to implement.

After you have the means you could then try to find out if any point is far from any of the means. You can define 'far' however you want but I would recommend the suggestions by @Amro as a good starting point.

For a more in-depth discussion of clustering algorithms refer to the wikipedia entry on clustering.

answered Oct 07 '22 09:10

smaclell

Related questions
                            
                                How to load only specific weights on Keras
                            
                                How to turn off dropout for testing in Tensorflow?
                            
                                Tensorflow Slim: TypeError: Expected int32, got list containing Tensors of type '_Message' instead
                            
                                Get learning rate of keras model
                            
                                Simple Python implementation of collaborative topic modeling?
                            
                                Tackling Class Imbalance: scaling contribution to loss and sgd
                            
                                confused about random_state in decision tree of scikit learn
                            
                                Python Implementation of OPTICS (Clustering) Algorithm
                            
                                What is Depth of a convolutional neural network?
                            
                                Early stopping with Keras and sklearn GridSearchCV cross-validation
                            
                                Why should we use Temperature in softmax? [closed]
                            
                                How do you read Tensorboard files programmatically?
                            
                                How to recognize rectangles in this image?
                            
                                What is the difference between reinforcement learning and deep RL?
                            
                                Best machine learning technique for matching product strings
                            
                                Distinguishing overfitting vs good prediction
                            
                                How to choose number of hidden layers and nodes in neural network? [closed]
                            
                                Which machine learning library to use [closed]
                            
                                Classifying Documents into Categories
                            
                                Keras flowFromDirectory get file names as they are being generated

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With