Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Isolation Forest in Python

I am currently working on detecting outliers in my dataset using Isolation Forest in Python and I did not completely understand the example and explanation given in scikit-learn documentation

Is it possible to use Isolation Forest to detect outliers in my dataset that has 258 rows and 10 columns?

Do I need a separate dataset to train the model? If yes, is it necessary to have that training dataset free from outliers?

This is my code:

rng = np.random.RandomState(42)
X = 0.3*rng.randn(100,2)
X_train = np.r_[X+2,X-2]
clf = IsolationForest(max_samples=100, random_state=rng, contamination='auto'
clf.fit(X_train)
y_pred_train = clf.predict(x_train)
y_pred_test = clf.predict(x_test)
print(len(y_pred_train))

I tried by loading my dataset to X_train but that does not seem to work.

like image 718
Nnn Avatar asked Feb 18 '19 06:02

Nnn


People also ask

What is the purpose of Isolation Forest?

In an Isolation Forest, randomly sub-sampled data is processed in a tree structure based on randomly selected features. The samples that travel deeper into the tree are less likely to be anomalies as they required more cuts to isolate them.

What is the difference between Random Forest and Isolation Forest?

Isolation Forest is similar in principle to Random Forest and is built on the basis of decision trees. Isolation Forest, however, identifies anomalies or outliers rather than profiling normal data points.

Is Isolation Forest supervised or unsupervised?

It is important to mention that Isolation Forest is an unsupervised machine learning algorithm. Meaning, there is no actual “training” or “learning” involved in the process and there is no pre-determined labeling of “outlier” or “not-outlier” in the dataset.

How many trees are in Isolation Forest?

STEP 2: GENERATION OF AN ISOLATION FOREST As the reader may expect, there is no precise mathematical definition as to how many trees make up a forest. For our purposes, the reader might think of 15 to 100 trees as a sensible size for an Isolation Forest.


1 Answers

Do I need a separate dataset to train the model?

Short answer is "No". You train and predict outliers on the same data.

IsolationForest is an unsupervised learning algorithm that's intended to clean your data from outliers (see docs for more). In usual machine learning settings, you would run it to clean your training dataset. As far as your toy example concerned:

rng = np.random.RandomState(42)
X = 0.3*rng.randn(100,2)
X_train = np.r_[X+2,X-2]

from sklearn.ensemble import IsolationForest
clf = IsolationForest(max_samples=100, random_state=rng, behaviour="new", contamination=.1)

clf.fit(X_train)
y_pred_train = clf.predict(X_train)
y_pred_train
array([ 1,  1,  1, -1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,
        1, -1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1, -1,  1, -1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1, -1,  1, -1,  1,  1,  1,  1,  1, -1, -1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
       -1,  1,  1, -1,  1,  1,  1,  1, -1, -1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1])

where 1 represent inliers and -1 represent outliers. As specified by contamination param, the fraction of outliers is 0.1.

Finally, you would remove outliers like:

X_train_cleaned = X_train[np.where(y_pred_train == 1, True, False)]
like image 122
Sergey Bushmanov Avatar answered Oct 28 '22 01:10

Sergey Bushmanov