Scikit-Learn's IsolationForest class has a method decision_function
that returns the anomaly scores of the input samples. However, the documentation does not state what the possible range of these scores is, and only states that "the lower [the score], the more abnormal."
Edit: after reading jmunsch's comment I looked at the source code again and here is my updated guess:
If the exponent in the scores formula is always negative, then scores will always be between 0 and 1, which would mean the returned range is [-0.5, 0.5] since 0.5 - scores
is returned by the method. But I'm not certain if the exponent would always be negative.
This score is an aggregation of the depth obtained from each of the iTrees. An anomaly score of -1 is assigned to anomalies and 1 to normal points based on the contamination(percentage of anomalies present in the data) parameter provided.
max_samples is the number of random samples it will pick from the original data set for creating Isolation trees. During the test phase: sklearn_IF finds the path length of data point under test from all the trained Isolation Trees and finds the average path length.
An anomaly score is created using an anomaly/id and the new instance (input_data) for which you wish to create an anomaly score. When you create a new anomaly score, BigML.io will automatically compute a score between 0 and 1. The closer the score is to 1, the more anomalous the instance being scored is.
The Isolation Forest algorithm is a fast tree-based algorithm for anomaly detection. The algorithm uses the concept of path lengths in binary search trees to assign anomaly scores to each point in a dataset.
In Scikit-Learn's IsolationForest the decision_function returns values in the range of [-0.5, 0.5] where -.5 is the most anomalous.
Or so I believe and have never seen evidence otherwise. The documentation for Scikit-Learn's IsolationForest references a paper Isolation-based Anomaly Detection by Liu et al. where equation 2 defines the anomaly score. In the paper the anomaly score ranges between 0 and 1, where 1 is most anomalous. In the scores function you reference on line 267 the variable depths.mean(axis=1) corresponds to E(h(x)) and _average_path_length(self.max_samples_)) corresponds to c(psi) in the paper. Thus on line 272 when the function returns 1 minus the score we get the bounds of [-0.5, 0.5].
Edit/Bonus:
The predict method of isolation forest effectively is just comparing the decision_function values to a threshold that is stored in model.threshold_
. So after calling the model's predict method on some data the anomalous items are the same items that meet the criteria:model.decision_function(data) < model.threshold_
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With