Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Do I need to split the data for isolation forest?

I have a database that consists of 10049972 rows x 19 columns. I was using Isolation Forest to detect outliers, then created an extra column that has outliers set as -1, I dropped all rows containing outliers as -1 then removed the column.

My question is: Do I need to do train, test and validate for isolation forest to work? Also can someone please confirm if my code is valid?

Here is my code.

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

%matplotlib inline

from sklearn.ensemble import IsolationForest


df = pd.read_csv('D:\\Project\\database\\4-Final\\Final After.csv',low_memory=True)


iForest = IsolationForest(n_estimators=100,  contamination=0.1 , random_state=42, max_samples=200)


iForest.fit(df.values.reshape(-1,1))

pred = iForest.predict(df.values.reshape(-1,1))

pred=df['anomaly']

df=df.drop(df['anomaly'==-1],inplace=True)

df.to_csv('D:\\Project\\database\\4-Final\\IF TEST.csv', index=False) 

Thank you.

like image 437
Ali Youssef Avatar asked Feb 13 '20 04:02

Ali Youssef


People also ask

Do you need to scale data for Isolation Forest?

Normalization is unnecessary. However it will neither hurt nor help (assuming linear scaling). Isolation forests work by splitting the data on a random feature at a random point between min and max.

How does Isolation Forest split?

The IsolationForest 'isolates' observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

How does the Isolation Forest work?

Isolation Forest is based on the Decision Tree algorithm. It isolates the outliers by randomly selecting a feature from the given set of features and then randomly selecting a split value between the max and min values of that feature.

Does Isolation Forest work on categorical data?

There was a small technical detail mentioned in that post—that the first isolation forest we used did not work with categorical data and that we had to use a different algorithm as a result.


1 Answers

My question is do I need to do test train and validate for isolation forest to work?

You want to detect outliers in just this batch file, right? In this case, your solution may be ok, but in most cases, you must split.

But please, try to understand when would you need to do the split. To explain this, let's enter into a real case scenario.

Let's suppose you are trying to predict the anomalous behaviour of different engines. You create a model using the data available in your database until "today", and start predicting incoming data. It may be possible that the predicted data is not equal to the data used to train, right? Then, how can you simulate this situation when you are configuring your model? Using train-test-validate and evaluating with the right metrics.

Edit: Let me add an example. I'll try to make it super simple.

If your engine data base data is:

+----+-------------+--------------+
| id | engine_type | engine_value |
+----+-------------+--------------+
|  1 |           0 | 0.25         |
|  2 |           0 | 0.40         |
|  3 |           1 | 0.16         |
|  4 |           1 | 0.30         |
|  5 |           0 | 5.3          | <- anomaly
|  6 |           1 | 14.4         | <- anomaly
|  7 |           0 | 16.30        | <- anomaly
+----+-------------+--------------+

And use it all to train the model, the model will use the three anomalous values to train, right? The algorithm will create the forest using these 3 anomalous values, so it can be easier for the model to predict them.

Now, what would happen with this production data:

+----+-------------+--------------+
| id | engine_type | engine_value |
+----+-------------+--------------+
|  8 |           1 | 3.25         | <- anomaly
|  9 |           1 | 4.40         | <- anomaly
| 10 |           0 | 2.16         |
+----+-------------+--------------+

You pass it to your model, and it says the points are not anomalous, but normal data, because it thinks your "threshold" is for values bigger than 5.

This "threshold" is product of the algorithm hyperparameters, maybe with other configuration, the model could have predicted the values as anomalous, but you are not testing the model generalization.

So how can you improve this configuration? Splitting the data that you have available at that moment. Instead of training with all the database data, you could have trained with only a part of it and use the other part to test, for example use this part as train data:

+----+-------------+--------------+
| id | engine_type | engine_value |
+----+-------------+--------------+
|  1 |           0 | 0.25         |
|  2 |           0 | 0.40         |
|  3 |           1 | 0.16         |
|  4 |           1 | 0.30         |
|  7 |           0 | 16.30        | <- anomaly
+----+-------------+--------------+

And this as test data:

+----+-------------+--------------+
| id | engine_type | engine_value |
+----+-------------+--------------+
|  5 |           0 | 5.3          | <- anomaly
|  6 |           1 | 14.4         | <- anomaly
+----+-------------+--------------+

And set a combination of hyperparameters that makes this algorithm predict the test data correctly. Does it ensure that in the future the predictions will be perfect? No, it does not, but it is not the same as just fitting the data without evaluating how well the model is generalizing.

Also can someone please confirm if my code is valid?

Yes, but let me add a recommendation, changing this:

iForest.fit(df.values.reshape(-1,1))

pred = iForest.predict(df.values.reshape(-1,1))

pred=df['anomaly']

To this:

df['anomaly'] = iForest.fit_predict(df.values.reshape(-1,1))

Also, if you are using the new pandas version, use:

df['anomaly'] = iForest.fit_predict(df.to_numpy().reshape(-1,1))
like image 134
Noki Avatar answered Oct 11 '22 12:10

Noki