I am trying to detect the outliers to my dataset and I find the sklearn's Isolation Forest. I can't understand how to work with it. I fit my training data in it and it gives me back a vector with -1 and 1 values. Can anyone explain to me how it works and provide an example? How can I know that the outliers are 'real' outliers? Tuning Parameters? Here is my code: <pre class="prettyprint"><code>clf = IsolationForest(max_samples=10000, random_state=10) clf.fit(x_train) y_pred_train = clf.predict(x_train) y_pred_test = clf.predict(x_test) [1 1 1 ..., -1 1 1] </code></pre>

It seems you have many questions, let me try to answer them one by one to the best of my knowledge. How it works? It works due to the fact that the nature of outliers in any data set, which is outliers, is few and different, which is quite different from the typical clustering-based or distance-based algorithm. At the top level, it works on the logic that outliers take fewer steps to 'isolate' compare to the 'normal' point in any data set. To do so, this is what IF does; suppose you have training data set X with n data points, each having m features. In training, IF creates Isolation trees (Binary search trees) for different features. For training, you have 3 parameters for tuning during the train phase: <ol> <li>number of isolation trees (<code>n_estimators</code> in sklearn_IsolationForest)</li> <li>number of samples (<code>max_samples</code> in sklearn_IsolationForest)</li> <li>number of features to draw from X to train each base estimator (<code>max_features</code> in sklearn_IF).</li> </ol> <code>max_samples</code> is the number of random samples it will pick from the original data set for creating Isolation trees. During the test phase: <ul> <li> sklearn_IF finds the path length of data point under test from all the trained Isolation Trees and finds the average path length. The higher the path length, the more normal the point, and vice-versa. </li> <li> Based on the average path length. It calculates the anomaly score, decision_function of sklearn_IF can be used to get this. For sklearn_IF, the lower the score, the more anomalous the sample. </li> <li> Based on the anomaly score, you can decide whether the given sample is anomalous or not by setting the proper value of <code>contamination</code> in the sklearn_IF object. The default value of <code>contamination</code> is 0.1, which you can tune for deciding the threshold. The amount of contamination of the data set, i.e., the proportion of outliers in the data set. </li> </ul> Tuning parameters Training -> <code>n_estimators</code>, <code>max_samples</code>, <code>max_features</code>. Testing -> <code>contamination</code>

How to use Isolation Forest

Tags:

I am trying to detect the outliers to my dataset and I find the sklearn's Isolation Forest. I can't understand how to work with it. I fit my training data in it and it gives me back a vector with -1 and 1 values.

Can anyone explain to me how it works and provide an example?

How can I know that the outliers are 'real' outliers?

Tuning Parameters?

Here is my code:

clf = IsolationForest(max_samples=10000, random_state=10) clf.fit(x_train) y_pred_train = clf.predict(x_train) y_pred_test = clf.predict(x_test)  [1 1 1 ..., -1 1 1]

303

asked Mar 28 '17 07:03

dapo

1 Answers

It seems you have many questions, let me try to answer them one by one to the best of my knowledge.

How it works?

It works due to the fact that the nature of outliers in any data set, which is outliers, is few and different, which is quite different from the typical clustering-based or distance-based algorithm. At the top level, it works on the logic that outliers take fewer steps to 'isolate' compare to the 'normal' point in any data set. To do so, this is what IF does; suppose you have training data set X with n data points, each having m features. In training, IF creates Isolation trees (Binary search trees) for different features.

For training, you have 3 parameters for tuning during the train phase:

number of isolation trees (n_estimators in sklearn_IsolationForest)
number of samples (max_samples in sklearn_IsolationForest)
number of features to draw from X to train each base estimator (max_features in sklearn_IF).

max_samples is the number of random samples it will pick from the original data set for creating Isolation trees.

During the test phase:

sklearn_IF finds the path length of data point under test from all the trained Isolation Trees and finds the average path length. The higher the path length, the more normal the point, and vice-versa.
Based on the average path length. It calculates the anomaly score, decision_function of sklearn_IF can be used to get this. For sklearn_IF, the lower the score, the more anomalous the sample.
Based on the anomaly score, you can decide whether the given sample is anomalous or not by setting the proper value of contamination in the sklearn_IF object. The default value of contamination is 0.1, which you can tune for deciding the threshold. The amount of contamination of the data set, i.e., the proportion of outliers in the data set.

Tuning parameters

Training -> n_estimators, max_samples, max_features.

Testing -> contamination

113

answered Oct 18 '22 12:10

Bhargav Upadhyay

Related questions
                            
                                How do I mock part of a python constructor just for testing?
                            
                                How to use ModelCheckpoint with custom metrics in Keras?
                            
                                Visual Studio 2015 not detecting v141 (2017) Build tools
                            
                                Val cannot be reassigned a compile time error for a local variable in fun in Kotlin
                            
                                Get the accurate duration of a video
                            
                                "cannot take the address of" and "cannot call pointer method on"
                            
                                Requests — always call raise_for_status
                            
                                Tomcat 7.0.73 doesn't work with java 9
                            
                                Update array object in React Redux reducer
                            
                                Getting a (null) reference in the Xcode pbxproj file
                            
                                Predicting a multiple forward time step of a time series using LSTM
                            
                                Python: join two bytearray objects

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With