Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the range of Scikit-Learn's IsolationForest decision_function scores?

Scikit-Learn's IsolationForest class has a method decision_function that returns the anomaly scores of the input samples. However, the documentation does not state what the possible range of these scores is, and only states that "the lower [the score], the more abnormal."

Edit: after reading jmunsch's comment I looked at the source code again and here is my updated guess: If the exponent in the scores formula is always negative, then scores will always be between 0 and 1, which would mean the returned range is [-0.5, 0.5] since 0.5 - scores is returned by the method. But I'm not certain if the exponent would always be negative.

like image 383
DataMan Avatar asked Jul 20 '17 19:07

DataMan


People also ask

What is score in Isolation Forest?

This score is an aggregation of the depth obtained from each of the iTrees. An anomaly score of -1 is assigned to anomalies and 1 to normal points based on the contamination(percentage of anomalies present in the data) parameter provided.

What is Max sample in Isolation Forest?

max_samples is the number of random samples it will pick from the original data set for creating Isolation trees. During the test phase: sklearn_IF finds the path length of data point under test from all the trained Isolation Trees and finds the average path length.

What is an anomaly score?

An anomaly score is created using an anomaly/id and the new instance (input_data) for which you wish to create an anomaly score. When you create a new anomaly score, BigML.io will automatically compute a score between 0 and 1. The closer the score is to 1, the more anomalous the instance being scored is.

What is the Isolation Forest algorithm?

The Isolation Forest algorithm is a fast tree-based algorithm for anomaly detection. The algorithm uses the concept of path lengths in binary search trees to assign anomaly scores to each point in a dataset.


1 Answers

In Scikit-Learn's IsolationForest the decision_function returns values in the range of [-0.5, 0.5] where -.5 is the most anomalous.

Or so I believe and have never seen evidence otherwise. The documentation for Scikit-Learn's IsolationForest references a paper Isolation-based Anomaly Detection by Liu et al. where equation 2 defines the anomaly score. In the paper the anomaly score ranges between 0 and 1, where 1 is most anomalous. In the scores function you reference on line 267 the variable depths.mean(axis=1) corresponds to E(h(x)) and _average_path_length(self.max_samples_)) corresponds to c(psi) in the paper. Thus on line 272 when the function returns 1 minus the score we get the bounds of [-0.5, 0.5].

Edit/Bonus: The predict method of isolation forest effectively is just comparing the decision_function values to a threshold that is stored in model.threshold_. So after calling the model's predict method on some data the anomalous items are the same items that meet the criteria:model.decision_function(data) < model.threshold_.

like image 164
Alex Avatar answered Nov 15 '22 00:11

Alex