I want to use IsolationForest
for finding outliers. I want to find the best parameters for model with GridSearchCV
. The problem is that I always get the same error:
TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator IsolationForest(behaviour='old', bootstrap=False, contamination='legacy',
max_features=1.0, max_samples='auto', n_estimators=100,
n_jobs=None, random_state=None, verbose=0, warm_start=False) does not.
It seems like its a problem because IsolationForest
does not have score
method.
Is there a way to fix this?
Also is there a way to find a score for isolation forest?
This is my code:
import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import GridSearchCV
df = pd.DataFrame({'first': [-112,0,1,28,5,6,3,5,4,2,7,5,1,3,2,2,5,2,42,84,13,43,13],
'second': [42,1,2,85,2,4,6,8,3,5,7,3,64,1,4,1,2,4,13,1,0,40,9],
'third': [3,4,7,74,3,8,2,4,7,1,53,6,5,5,59,0,5,12,65,4,3,4,11],
'result': [5,2,3,0.04,3,4,3,125,6,6,0.8,9,1,4,59,12,1,4,0,8,5,4,1]})
x = df.iloc[:,:-1]
tuned = {'n_estimators':[70,80,100,120,150,200], 'max_samples':['auto', 1,3,5,7,10],
'contamination':['legacy', 'outo'], 'max_features':[1,2,3,4,5,6,7,8,9,10,13,15],
'bootstrap':[True,False], 'n_jobs':[None,1,2,3,4,5,6,7,8,10,15,20,25,30], 'behaviour':['old', 'new'],
'random_state':[None,1,5,10,42], 'verbose':[0,1,2,3,4,5,6,7,8,9,10], 'warm_start':[True,False]}
isolation_forest = GridSearchCV(IsolationForest(), tuned)
model = isolation_forest.fit(x)
list_of_val = [[1,35,3], [3,4,5], [1,4,66], [4,6,1], [135,5,0]]
df['outliers'] = model.predict(x)
df['outliers'] = df['outliers'].map({-1: 'outlier', 1: 'good'})
print(model.best_params_)
print(df)
Outlier detection estimators thus try to fit the regions where the training data is the most concentrated, ignoring the deviant observations. novelty detection: The training data is not polluted by outliers and we are interested in detecting whether a new observation is an outlier.
Normalization is unnecessary. However it will neither hurt nor help (assuming linear scaling). Isolation forests work by splitting the data on a random feature at a random point between min and max.
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the scores of the samples. If 'auto', the threshold is determined as in the original paper. If float, the contamination should be in the range (0, 0.5].
The Local Outlier Factor (LOF) algorithm is an unsupervised anomaly detection method which computes the local density deviation of a given data point with respect to its neighbors. It considers as outliers the samples that have a substantially lower density than their neighbors.
You need to create your own scoring function since IsolationForest
does not have score
method inbuilt. Instead you can make use of the score_samples
function that is available in IsolationForest
(can be considered as a proxy for score
) and create your own scorer as described here and pass it to the GridSearchCV
. I have modified your code to do this:
import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import GridSearchCV
df = pd.DataFrame({'first': [-112,0,1,28,5,6,3,5,4,2,7,5,1,3,2,2,5,2,42,84,13,43,13],
'second': [42,1,2,85,2,4,6,8,3,5,7,3,64,1,4,1,2,4,13,1,0,40,9],
'third': [3,4,7,74,3,8,2,4,7,1,53,6,5,5,59,0,5,12,65,4,3,4,11],
'result': [5,2,3,0.04,3,4,3,125,6,6,0.8,9,1,4,59,12,1,4,0,8,5,4,1]})
x = df.iloc[:,:-1]
tuned = {'n_estimators':[70,80], 'max_samples':['auto'],
'contamination':['legacy'], 'max_features':[1],
'bootstrap':[True], 'n_jobs':[None,1,2], 'behaviour':['old'],
'random_state':[None,1,], 'verbose':[0,1,2], 'warm_start':[True]}
def scorer_f(estimator, X): #your own scorer
return np.mean(estimator.score_samples(X))
#or you could use a lambda aexpression as shown below
#scorer = lambda est, data: np.mean(est.score_samples(data))
isolation_forest = GridSearchCV(IsolationForest(), tuned, scoring=scorer_f)
model = isolation_forest.fit(x)
SAMPLE OUTPUT
print(model.best_params_)
{'behaviour': 'old',
'bootstrap': True,
'contamination': 'legacy',
'max_features': 1,
'max_samples': 'auto',
'n_estimators': 70,
'n_jobs': None,
'random_state': None,
'verbose': 1,
'warm_start': True}
Hope this helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With