I am trying to use isolation forest algorithm with Python scikit-learn.
I do not understand why do I have to generate the sets X_test
and X_outliers
, because, when I get my data, I have no idea if there are outliers or not in it. But maybe this is just an example and I do not have to generate and fill that sets for every case. I thought that isolation forest does not have to receive a clean X_train
(with no outliers).
Did I misunderstand the algorithm? Do I have to use an other algorithm (I thought about one-class SVM but its X_train
has to be as clean as possible)?
Does the isolation forest algorithm is an unsupervised algorithm or a supervised one (like the random forest algorithm)?
The Isolation Forest algorithm is a fast tree-based algorithm for anomaly detection. The algorithm uses the concept of path lengths in binary search trees to assign anomaly scores to each point in a dataset.
Previously we stated that the Isolation Forest will return a score bound between 0 - 1 where the values closer to 1 are considered Anomalous and the values that are <0.5 are considered to be "normal".
Interpreting Anomaly Scoresif instances return s very close to 1, then they are definitely anomalies. if instances have s smaller than 0.5, then they are quite safe to be regarded as normal instances. if all the instances return s≈0.5, then the entire sample does not really have any distinct anomaly.
Using Isolation Forest, we can not only detect anomalies faster but we also require less memory compared to other algorithms. Isolation Forest isolates anomalies in the data points instead of profiling normal data points.
"Does the isolation forest algorithm is an unsupervised algorithm or a supervised one (like the random forest algorithm)?"
Isolation tree is an unsupervised algorithm and therefore it does not need labels to identify the outlier/anomaly. It follows the following steps:
The end of the tree is reached once the recursive partition of data is finished. It is expected that the distance taken to reach the outlier is far less than that for the normal data (see the figure).
The distance of the path is averaged and normalised to calculate the anomaly score. Anomaly score of 1 is considered as an outlier, values close to 0 is considered normal.
The judgment of the outlier is carried out on the basis of the score. There is no need for a label column. Therefore it is an unsupervised algorithm.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With