My question is regarding the Novelty detection algorithms - Isolation Forest and One Class SVM. I have a training dataset(with 4-5 features) where all the sample points are inliers and I need to classify any new data as an inlier or outlier and ingest in another dataframe accordingly.
While trying to use Isolation Forest or One Class SVM, i have to input the contamination percentage(nu) during the training phase. However as the training dataset doesn't have any contamination, do I need to add outliers to the training dataframe and put that outlier fraction as nu.
Also while using the Isolation forest, I noticed that the outlier percentage changes everytime I predict, even though i don't change the model. Is there a way to take care of this problem apart from going into the Extended Isolation Forest algorithm.
Thanks in advance.
Regarding contamination for isolation forest,
If you are training for the normal instances (all inliers), you should put zero for contamination. If you don't specify this, contamination would be 0.1 (for version 0.2).
The following is a simple code to show this,
1- Import libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
rng = np.random.RandomState(42)
2- Generate a 2D dataset
X = 0.3 * rng.randn(1000, 2)
3- Train iForest model and predict the outliers
clf = IsolationForest(random_state=rng, contamination=0)
clf.fit(X)
y_pred_train = clf.predict(X)
4- Print # of anomalies
print(sum(y_pred_train==-1))
This would give you 0 anomalies. Now if you change the contamination to 0.15, the program specifies 150 anomalies out of the same dataset you already had (same because of RandomState(42)).
[References]:
"Training with normal data(inliers) only".
This is against the nature of Isolation Forest. The training is here completely different than training in the Neural Networks. Because everyone is using these without clarifying what is going on, and writing blogs with 20% of ML knowledge, we are having questions like this.
clf = IsolationForest(random_state=rng, contamination=0)
clf.fit(X)
What does fit
do here? Is it training? If yes, what is trained?
In Isolation Forest:
contamination
will determine your threshold. if it is 0
, then what is your threshold
?
Please read the original paper first to understand the logic behind it. Not all anomaly detection algorithms suit for every occasion.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With