I am trying to learn how to find the best parameters for a classifier. So, I am using GridSearchCV for a multi-class classification problem. A dummy code was generated on Does not GridSearchCV support multi-class? I am just using that code with n_classes=3.
import numpy as np
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler,label_binarize
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import accuracy_score, recall_score, f1_score, roc_auc_score, make_scorer
X, y = make_classification(n_samples=3000, n_features=10, weights=[0.1, 0.9, 0.3],n_classes=3, n_clusters_per_class=1,n_informative=2)
pipe = make_pipeline(StandardScaler(), SVC(kernel='rbf', class_weight='auto'))
param_space = dict(svc__C=np.logspace(-5,0,5), svc__gamma=np.logspace(-2, 2, 10))
f1_score
my_scorer = make_scorer(f1_score, greater_is_better=True)
gscv = GridSearchCV(pipe, param_space, scoring=my_scorer)
I am trying to do One-hot encoding as advised here Scikit-learn GridSearch giving "ValueError: multiclass format is not supported" error. Also, sometimes there will be dataset like Toxic Comment Classification dataset on Kaggle which will give you binarized labels.
y = label_binarize(y, classes=[0, 1, 2])
for i in classes:
gscv.fit(X, y[i])
print gscv.best_params_
I am getting:
ValueError: bad input shape (2000L, 3L)
I am not sure why I am getting this error. My objective is to find the best parameters for a multi-class classification problem.
There are two problems in the two parts of your code.
1) Let's start with first part when you have not one-hot encoded the labels. You see, SVC supports the multi-class cases just fine. But the f1_score when combined with (inside) GridSearchCV does not.
f1_score by default returns the scores of positive label in case of binary classification so will throw error in your case.
OR It also can return an array of scores (one for each class), but GridSearchCV only accepts a single value as score because it needs that for finding the best score and best combination of hyper-parameters. So you need to pass the averaging method in f1_score to get a single value from the array.
According to the f1_score documentation, following averaging methods are allowed:
average : string, [None, ‘binary’ (default), ‘micro’, ‘macro’, ‘samples’, ‘weighted’]
So change your make_scorer like this:
my_scorer = make_scorer(f1_score, greater_is_better=True, average='micro')
Change the 'average' param above as it suits you.
2) Now coming to the second part: When you one-hot encode the labels, the shape of y becomes 2-d, but SVC only supports a 1-d array as y as specified in the documentation:
fit(X, y, sample_weight=None)[source] X : {array-like, sparse matrix}, shape (n_samples, n_features) y : array-like, shape (n_samples,)
But even if you encode the labels and use a classifier which supports the 2-d labels, then the first error will have to be solved. So I would advice you not to one-hot encode the labels and just change the f1_score.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With