Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

GridSearch for Multi-label classification in Scikit-learn

I am trying to do GridSearch for best hyper-parameters in every individual one of ten folds cross validation, it worked fine with my previous multi-class classification work, but not the case this time with multi-label work.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
clf = OneVsRestClassifier(LinearSVC())

C_range = 10.0 ** np.arange(-2, 9)
param_grid = dict(estimator__clf__C = C_range)

clf = GridSearchCV(clf, param_grid)
clf.fit(X_train, y_train)

I am getting the error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-65-dcf9c1d2e19d> in <module>()
      6 
      7 clf = GridSearchCV(clf, param_grid)
----> 8 clf.fit(X_train, y_train)

/usr/local/lib/python2.7/site-packages/sklearn/grid_search.pyc in fit(self, X, y)
    595 
    596         """
--> 597         return self._fit(X, y, ParameterGrid(self.param_grid))
    598 
    599 

/usr/local/lib/python2.7/site-packages/sklearn/grid_search.pyc in _fit(self, X, y,   
parameter_iterable)
    357                                  % (len(y), n_samples))
    358             y = np.asarray(y)
--> 359         cv = check_cv(cv, X, y, classifier=is_classifier(estimator))
    360 
    361         if self.verbose > 0:

/usr/local/lib/python2.7/site-packages/sklearn/cross_validation.pyc in _check_cv(cv, X,  
y, classifier, warn_mask)
   1365             needs_indices = None
   1366         if classifier:
-> 1367             cv = StratifiedKFold(y, cv, indices=needs_indices)
   1368         else:
   1369             if not is_sparse:

/usr/local/lib/python2.7/site-packages/sklearn/cross_validation.pyc in __init__(self, 
y, n_folds, indices, shuffle, random_state)
    427         for test_fold_idx, per_label_splits in enumerate(zip(*per_label_cvs)):
    428             for label, (_, test_split) in zip(unique_labels, per_label_splits):
--> 429                 label_test_folds = test_folds[y == label]
    430                 # the test split can be too big because we used
    431                 # KFold(max(c, self.n_folds), self.n_folds) instead of

ValueError: boolean index array should have 1 dimension

Which might refer to the dimension or the format of the label indicator.

print X_train.shape, y_train.shape

get:

(147, 1024) (147, 6)

Seems GridSearch implements StratifiedKFold inherently. The problem raises in the stratified K-fold strategy with multi-label problem.

StratifiedKFold(y_train, 10)

gives

ValueError                                Traceback (most recent call last)
<ipython-input-87-884ffeeef781> in <module>()
----> 1 StratifiedKFold(y_train, 10)

/usr/local/lib/python2.7/site-packages/sklearn/cross_validation.pyc in __init__(self,   
y, n_folds, indices, shuffle, random_state)
    427         for test_fold_idx, per_label_splits in enumerate(zip(*per_label_cvs)):
    428             for label, (_, test_split) in zip(unique_labels, per_label_splits):
--> 429                 label_test_folds = test_folds[y == label]
    430                 # the test split can be too big because we used
    431                 # KFold(max(c, self.n_folds), self.n_folds) instead of

ValueError: boolean index array should have 1 dimension

Current use of conventional K-fold strategy works fine. Is there any method to implement stratified K-fold to multi-label classification?

like image 605
Francis Avatar asked Mar 18 '23 09:03

Francis


1 Answers

Grid search performs stratified cross-validation for classification problems, but for multi-label tasks this is not implemented; in fact, multi-label stratification is an unsolved problem in machine learning. I recently faced the same issue, and all the literature that I could find was a proposed method in this article (the authors of which state that they couldn't find any other attempts at solving this either).

like image 112
Fred Foo Avatar answered Apr 26 '23 23:04

Fred Foo