Isolation Forest in Python

Tags:

I am currently working on detecting outliers in my dataset using Isolation Forest in Python and I did not completely understand the example and explanation given in scikit-learn documentation

Is it possible to use Isolation Forest to detect outliers in my dataset that has 258 rows and 10 columns?

Do I need a separate dataset to train the model? If yes, is it necessary to have that training dataset free from outliers?

This is my code:

rng = np.random.RandomState(42)
X = 0.3*rng.randn(100,2)
X_train = np.r_[X+2,X-2]
clf = IsolationForest(max_samples=100, random_state=rng, contamination='auto'
clf.fit(X_train)
y_pred_train = clf.predict(x_train)
y_pred_test = clf.predict(x_test)
print(len(y_pred_train))

I tried by loading my dataset to X_train but that does not seem to work.

718

asked Feb 18 '19 06:02

Nnn

1 Answers

Do I need a separate dataset to train the model?

Short answer is "No". You train and predict outliers on the same data.

IsolationForest is an unsupervised learning algorithm that's intended to clean your data from outliers (see docs for more). In usual machine learning settings, you would run it to clean your training dataset. As far as your toy example concerned:

rng = np.random.RandomState(42)
X = 0.3*rng.randn(100,2)
X_train = np.r_[X+2,X-2]

from sklearn.ensemble import IsolationForest
clf = IsolationForest(max_samples=100, random_state=rng, behaviour="new", contamination=.1)

clf.fit(X_train)
y_pred_train = clf.predict(X_train)
y_pred_train
array([ 1,  1,  1, -1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,
        1, -1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1, -1,  1, -1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1, -1,  1, -1,  1,  1,  1,  1,  1, -1, -1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
       -1,  1,  1, -1,  1,  1,  1,  1, -1, -1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1])

where 1 represent inliers and -1 represent outliers. As specified by contamination param, the fraction of outliers is 0.1.

Finally, you would remove outliers like:

X_train_cleaned = X_train[np.where(y_pred_train == 1, True, False)]

122

answered Oct 28 '22 01:10

Sergey Bushmanov

Related questions
                            
                                How to use mock_open with json.load()?
                            
                                No module named '__main__.demo'; '__main__' is not a package python3
                            
                                I have to check if the string contains: alphanumeric, alphabetical , digits, lowercase and uppercase characters
                            
                                drops a column if it exceeds a specific number of NA values
                            
                                Why does tf.Print() not work?
                            
                                Is it possible to share a piece of code betwen AWS Lambda functions?
                            
                                How to break up lambda function in to its own function? (Lambda is currently 125+ characters)
                            
                                python3 fabric import Error: cannot import Connection
                            
                                normalization of categorical variable
                            
                                pandas apply changing dtype
                            
                                Simple method to extract specific color range from an image in Python?
                            
                                Python2 vs Python3: Different result when converting to datetime from timestamp
                            
                                How to get document_topics distribution of all of the document in gensim LDA?
                            
                                Python 3 equivalent of Python 2 str.decode('hex') [duplicate]
                            
                                Converting python UTC timestamp to and from string
                            
                                float() object id creation order
                            
                                Export pandas dataframe to json and back to a dataframe with columns in the same order
                            
                                How to run a Python project using __pycache__ folder?
                            
                                Tensorflow predict the class of output
                            
                                Unable to Import in VS Code

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Isolation Forest in Python

Tags:

python-3.x

outliers

scikit-learn

anomaly-detection

Nnn

People also ask

1 Answers

Sergey Bushmanov

Recent Activity

Donate For Us