One Class SVM and Isolation Forest for novelty detection

Question

My question is regarding the Novelty detection algorithms - Isolation Forest and One Class SVM. I have a training dataset(with 4-5 features) where all the sample points are inliers and I need to classify any new data as an inlier or outlier and ingest in another dataframe accordingly.

While trying to use Isolation Forest or One Class SVM, i have to input the contamination percentage(nu) during the training phase. However as the training dataset doesn't have any contamination, do I need to add outliers to the training dataframe and put that outlier fraction as nu.

Also while using the Isolation forest, I noticed that the outlier percentage changes everytime I predict, even though i don't change the model. Is there a way to take care of this problem apart from going into the Extended Isolation Forest algorithm.

Thanks in advance.

M. Esmalifalak PhD · Accepted Answer

Regarding contamination for isolation forest,

If you are training for the normal instances (all inliers), you should put zero for contamination. If you don't specify this, contamination would be 0.1 (for version 0.2).

The following is a simple code to show this,

1- Import libraries

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
rng = np.random.RandomState(42)

2- Generate a 2D dataset

X = 0.3 * rng.randn(1000, 2)

3- Train iForest model and predict the outliers

clf = IsolationForest(random_state=rng, contamination=0)
clf.fit(X)
y_pred_train = clf.predict(X)

4- Print # of anomalies

print(sum(y_pred_train==-1))

This would give you 0 anomalies. Now if you change the contamination to 0.15, the program specifies 150 anomalies out of the same dataset you already had (same because of RandomState(42)).

[References]:

1 Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. "Isolation forest." Data Mining, 2008. ICDM'08. Eighth IEEE International Conference

2 Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. "Isolation-based anomaly detection." ACM Transactions on Knowledge Discovery from Data (TKDD), (2012)

Mr. Panda · Answer

"Training with normal data(inliers) only".

This is against the nature of Isolation Forest. The training is here completely different than training in the Neural Networks. Because everyone is using these without clarifying what is going on, and writing blogs with 20% of ML knowledge, we are having questions like this.

clf = IsolationForest(random_state=rng, contamination=0)
clf.fit(X)

What does fit do here? Is it training? If yes, what is trained?

In Isolation Forest:

First, we build trees,
Then, we pass each data point through each tree,
Then, we calculate the average path that is required to isolate the point.
The shorter the path, the higher the anomaly score.

contamination will determine your threshold. if it is 0, then what is your threshold?

Please read the original paper first to understand the logic behind it. Not all anomaly detection algorithms suit for every occasion.

One Class SVM and Isolation Forest for novelty detection

Tags:

machine-learning

one-class-classification

subhadeep sarkar

2 Answers

M. Esmalifalak PhD

Mr. Panda

Recent Activity

Donate For Us

One Class SVM and Isolation Forest for novelty detection

Tags:

machine-learning

one-class-classification

subhadeep sarkar

2 Answers

M. Esmalifalak PhD

Mr. Panda

Related questions

Recent Activity

Donate For Us