Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Passing Target/Label data to Scikit-learn GridSearchCV's fit method for OneClassSVM

From my understanding, One-Class SVM's are trained without target/label data.

One answer at Use of OneClassSVM with GridSearchCV suggests passing Target/Label data to GridSearchCV's fit method when the classifier is the OneClassSVM.

How does the GridSearchCV method handle this data?

Does it actually train the OneClassSVM without the Target/label data, and just use the Target/label data for evaluation?

I tried following the GridSearchCV source code, but I couldn't find the answer.

like image 361
user3731622 Avatar asked Oct 01 '19 01:10

user3731622


People also ask

What does the Fit () method in scikit-learn do?

The fit() method takes the training data as arguments, which can be one array in the case of unsupervised learning, or two arrays in the case of supervised learning.

What are two parameters passed in fit method?

fit method takes two parameters, the list of points and another list of just y coordinates. X are your data samples, where each row is a datapoint (one sample, a N-dimensional feature vector). y are the datapoint labels, one per datapoint.

What is CLF fit?

Then, a classifier named clf is defined as an object for our model in the fourth line. The fit method in fifth line fits the training dataset as features (data) and labels (target) into the Naive Bayes' model. The predict method predicts our actual testing dataset with regard to the fitted (training) data.

What method does scikit-learn to find the best classification hypothesis for the training data?

Linear discriminant analysis, as you may be able to guess, is a linear classification algorithm and best used when the data has a linear relationship.


1 Answers

Does it actually train the OneClassSVM without the Target/label data, and just use the Target/label data for evaluation?

Yes to both.

GridSearchCV does actually send labels to OneClassSVM in fit call, but OneClassSVM simply ignores it. Notice in the 2nd link how an array of ones is sent to main SVM trainer instead of given label array y. Parameters like y in fit exists only so that meta estimators like GridSearchCV can work in a consistent way without worrying about supervised/unsupervised estimators.

To actually test this, lets first detect outliers using GridSearchCV:

X,y = load_iris(return_X_y=True)
yd = np.where(y==0,-1,1)
cv = KFold(n_splits=4,random_state=42,shuffle=True)
model = GridSearchCV(OneClassSVM(),{'gamma':['scale']},cv=cv,iid=False,scoring=make_scorer(f1_score))
model = model.fit(X,yd)
print(model.cv_results_)

Note all the splitx_test_score in cv_results_.

Now lets do it manually, without sending labels yd during fit call:

for train,test in cv.split(X,yd):
    clf = OneClassSVM(gamma='scale').fit(X[train])  #Just features
    print(f1_score(yd[test],clf.predict(X[test])))

Both should yield exactly same scores.

like image 88
Shihab Shahriar Khan Avatar answered Nov 09 '22 15:11

Shihab Shahriar Khan