Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

One Class SVM and Isolation Forest for novelty detection

My question is regarding the Novelty detection algorithms - Isolation Forest and One Class SVM. I have a training dataset(with 4-5 features) where all the sample points are inliers and I need to classify any new data as an inlier or outlier and ingest in another dataframe accordingly.

While trying to use Isolation Forest or One Class SVM, i have to input the contamination percentage(nu) during the training phase. However as the training dataset doesn't have any contamination, do I need to add outliers to the training dataframe and put that outlier fraction as nu.

Also while using the Isolation forest, I noticed that the outlier percentage changes everytime I predict, even though i don't change the model. Is there a way to take care of this problem apart from going into the Extended Isolation Forest algorithm.

Thanks in advance.

like image 478
subhadeep sarkar Avatar asked Oct 15 '19 12:10

subhadeep sarkar


2 Answers

Regarding contamination for isolation forest,

If you are training for the normal instances (all inliers), you should put zero for contamination. If you don't specify this, contamination would be 0.1 (for version 0.2).

The following is a simple code to show this,

1- Import libraries

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
rng = np.random.RandomState(42)

2- Generate a 2D dataset

X = 0.3 * rng.randn(1000, 2)

3- Train iForest model and predict the outliers

clf = IsolationForest(random_state=rng, contamination=0)
clf.fit(X)
y_pred_train = clf.predict(X)  

4- Print # of anomalies

print(sum(y_pred_train==-1))

This would give you 0 anomalies. Now if you change the contamination to 0.15, the program specifies 150 anomalies out of the same dataset you already had (same because of RandomState(42)).

[References]:

1 Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. "Isolation forest." Data Mining, 2008. ICDM'08. Eighth IEEE International Conference

2 Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. "Isolation-based anomaly detection." ACM Transactions on Knowledge Discovery from Data (TKDD), (2012)

like image 172
M. Esmalifalak PhD Avatar answered Sep 21 '22 19:09

M. Esmalifalak PhD


"Training with normal data(inliers) only".

This is against the nature of Isolation Forest. The training is here completely different than training in the Neural Networks. Because everyone is using these without clarifying what is going on, and writing blogs with 20% of ML knowledge, we are having questions like this.

clf = IsolationForest(random_state=rng, contamination=0)
clf.fit(X)

What does fit do here? Is it training? If yes, what is trained?

In Isolation Forest:

  1. First, we build trees,
  2. Then, we pass each data point through each tree,
  3. Then, we calculate the average path that is required to isolate the point.
  4. The shorter the path, the higher the anomaly score.

contamination will determine your threshold. if it is 0, then what is your threshold?

Please read the original paper first to understand the logic behind it. Not all anomaly detection algorithms suit for every occasion.

like image 26
Mr. Panda Avatar answered Sep 23 '22 19:09

Mr. Panda