Trying to understand isolation forest algorithm

Q: How do you interpret anomaly scores?

Interpreting Anomaly Scoresif instances return s very close to 1, then they are definitely anomalies. if instances have s smaller than 0.5, then they are quite safe to be regarded as normal instances. if all the instances return s≈0.5, then the entire sample does not really have any distinct anomaly.

Q: What is the main application of an Isolation Forest?

Using Isolation Forest, we can not only detect anomalies faster but we also require less memory compared to other algorithms. Isolation Forest isolates anomalies in the data points instead of profiling normal data points.

Tags:

python

algorithm

scikit-learn

I am trying to use isolation forest algorithm with Python scikit-learn.

I do not understand why do I have to generate the sets X_test and X_outliers, because, when I get my data, I have no idea if there are outliers or not in it. But maybe this is just an example and I do not have to generate and fill that sets for every case. I thought that isolation forest does not have to receive a clean X_train (with no outliers).

Did I misunderstand the algorithm? Do I have to use an other algorithm (I thought about one-class SVM but its X_train has to be as clean as possible)?

Does the isolation forest algorithm is an unsupervised algorithm or a supervised one (like the random forest algorithm)?

741

asked Dec 12 '16 22:12

Chènevis

1 Answers

"Does the isolation forest algorithm is an unsupervised algorithm or a supervised one (like the random forest algorithm)?"

Isolation tree is an unsupervised algorithm and therefore it does not need labels to identify the outlier/anomaly. It follows the following steps:

Random and recursive partition of data is carried out, which is represented as a tree (random forest). This is the training stage where the user defines the parameters of the subsample and the number of trees. The author (Liu and Ting, 2008) suggest the default value of 256 for sub sample and 100 trees. The convergence is reached as the number of tree increases. However, fine tuning may be required on the case basis.

The end of the tree is reached once the recursive partition of data is finished. It is expected that the distance taken to reach the outlier is far less than that for the normal data (see the figure).
The distance of the path is averaged and normalised to calculate the anomaly score. Anomaly score of 1 is considered as an outlier, values close to 0 is considered normal.

The judgment of the outlier is carried out on the basis of the score. There is no need for a label column. Therefore it is an unsupervised algorithm.

103

answered Oct 16 '22 11:10

Amar nayak

Related questions
                            
                                Set openpyxl cell format to currency
                            
                                Printing string with two columns
                            
                                JavaScript raises SyntaxError with data rendered in Jinja template
                            
                                Writing multiple pandas dataframes to multiple excel worksheets
                            
                                Is it possible to split a network across multiple GPUs in tensorflow?
                            
                                Python Inheritance: Is it necessary to explicitly call the parents constructor and destructor?
                            
                                Can't install python Polyglot package on Windows
                            
                                How to print progress when training a DNNClassifier in tensorflow r0.9 (skflow)?
                            
                                Aggregate query in mongo works, does not in Pymongo
                            
                                DataFrame: add column whose values are the quantile number/rank of an existing column?
                            
                                TypeError: list indices must be integers, not str (boolean convertion actually)
                            
                                How to combine n-grams into one vocabulary in Spark?
                            
                                How do I call a database function using SQLAlchemy in Flask?
                            
                                Reorder Python argparse argument groups
                            
                                python: update dataframe to existing excel sheet without overwriting contents on the same sheet and other sheets
                            
                                Flask send stream as response
                            
                                Convert date to ordinal python?
                            
                                NetworkX: how to add weights to an existing G.edges()?
                            
                                How can I sample equally from a dataframe?
                            
                                How to group by one column and sort the values of another column?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With