I am currently working on detecting outliers in my dataset using Isolation Forest in Python and I did not completely understand the example and explanation given in scikit-learn documentation
Is it possible to use Isolation Forest to detect outliers in my dataset that has 258 rows and 10 columns?
Do I need a separate dataset to train the model? If yes, is it necessary to have that training dataset free from outliers?
This is my code:
rng = np.random.RandomState(42)
X = 0.3*rng.randn(100,2)
X_train = np.r_[X+2,X-2]
clf = IsolationForest(max_samples=100, random_state=rng, contamination='auto'
clf.fit(X_train)
y_pred_train = clf.predict(x_train)
y_pred_test = clf.predict(x_test)
print(len(y_pred_train))
I tried by loading my dataset to X_train
but that does not seem to work.
In an Isolation Forest, randomly sub-sampled data is processed in a tree structure based on randomly selected features. The samples that travel deeper into the tree are less likely to be anomalies as they required more cuts to isolate them.
Isolation Forest is similar in principle to Random Forest and is built on the basis of decision trees. Isolation Forest, however, identifies anomalies or outliers rather than profiling normal data points.
It is important to mention that Isolation Forest is an unsupervised machine learning algorithm. Meaning, there is no actual “training” or “learning” involved in the process and there is no pre-determined labeling of “outlier” or “not-outlier” in the dataset.
STEP 2: GENERATION OF AN ISOLATION FOREST As the reader may expect, there is no precise mathematical definition as to how many trees make up a forest. For our purposes, the reader might think of 15 to 100 trees as a sensible size for an Isolation Forest.
Do I need a separate dataset to train the model?
Short answer is "No". You train and predict outliers on the same data.
IsolationForest
is an unsupervised learning algorithm that's intended to clean your data from outliers (see docs for more). In usual machine learning settings, you would run it to clean your training dataset. As far as your toy example concerned:
rng = np.random.RandomState(42)
X = 0.3*rng.randn(100,2)
X_train = np.r_[X+2,X-2]
from sklearn.ensemble import IsolationForest
clf = IsolationForest(max_samples=100, random_state=rng, behaviour="new", contamination=.1)
clf.fit(X_train)
y_pred_train = clf.predict(X_train)
y_pred_train
array([ 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1,
1, -1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, -1, 1, -1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, -1, 1, -1, 1, 1, 1, 1, 1, -1, -1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
-1, 1, 1, -1, 1, 1, 1, 1, -1, -1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
where 1
represent inliers and -1
represent outliers. As specified by contamination
param, the fraction of outliers is 0.1
.
Finally, you would remove outliers like:
X_train_cleaned = X_train[np.where(y_pred_train == 1, True, False)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With