I am trying to do GridSearch for best hyper-parameters in every individual one of ten folds cross validation, it worked fine with my previous multi-class classification work, but not the case this time with multi-label work.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
clf = OneVsRestClassifier(LinearSVC())
C_range = 10.0 ** np.arange(-2, 9)
param_grid = dict(estimator__clf__C = C_range)
clf = GridSearchCV(clf, param_grid)
clf.fit(X_train, y_train)
I am getting the error:
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-65-dcf9c1d2e19d> in <module>()
      6 
      7 clf = GridSearchCV(clf, param_grid)
----> 8 clf.fit(X_train, y_train)
/usr/local/lib/python2.7/site-packages/sklearn/grid_search.pyc in fit(self, X, y)
    595 
    596         """
--> 597         return self._fit(X, y, ParameterGrid(self.param_grid))
    598 
    599 
/usr/local/lib/python2.7/site-packages/sklearn/grid_search.pyc in _fit(self, X, y,   
parameter_iterable)
    357                                  % (len(y), n_samples))
    358             y = np.asarray(y)
--> 359         cv = check_cv(cv, X, y, classifier=is_classifier(estimator))
    360 
    361         if self.verbose > 0:
/usr/local/lib/python2.7/site-packages/sklearn/cross_validation.pyc in _check_cv(cv, X,  
y, classifier, warn_mask)
   1365             needs_indices = None
   1366         if classifier:
-> 1367             cv = StratifiedKFold(y, cv, indices=needs_indices)
   1368         else:
   1369             if not is_sparse:
/usr/local/lib/python2.7/site-packages/sklearn/cross_validation.pyc in __init__(self, 
y, n_folds, indices, shuffle, random_state)
    427         for test_fold_idx, per_label_splits in enumerate(zip(*per_label_cvs)):
    428             for label, (_, test_split) in zip(unique_labels, per_label_splits):
--> 429                 label_test_folds = test_folds[y == label]
    430                 # the test split can be too big because we used
    431                 # KFold(max(c, self.n_folds), self.n_folds) instead of
ValueError: boolean index array should have 1 dimension
Which might refer to the dimension or the format of the label indicator.
print X_train.shape, y_train.shape
get:
(147, 1024) (147, 6)
Seems GridSearch implements StratifiedKFold inherently.
The problem raises in the stratified K-fold strategy with multi-label problem.
StratifiedKFold(y_train, 10)
gives
ValueError                                Traceback (most recent call last)
<ipython-input-87-884ffeeef781> in <module>()
----> 1 StratifiedKFold(y_train, 10)
/usr/local/lib/python2.7/site-packages/sklearn/cross_validation.pyc in __init__(self,   
y, n_folds, indices, shuffle, random_state)
    427         for test_fold_idx, per_label_splits in enumerate(zip(*per_label_cvs)):
    428             for label, (_, test_split) in zip(unique_labels, per_label_splits):
--> 429                 label_test_folds = test_folds[y == label]
    430                 # the test split can be too big because we used
    431                 # KFold(max(c, self.n_folds), self.n_folds) instead of
ValueError: boolean index array should have 1 dimension
Current use of conventional K-fold strategy works fine. Is there any method to implement stratified K-fold to multi-label classification?
Grid search performs stratified cross-validation for classification problems, but for multi-label tasks this is not implemented; in fact, multi-label stratification is an unsolved problem in machine learning. I recently faced the same issue, and all the literature that I could find was a proposed method in this article (the authors of which state that they couldn't find any other attempts at solving this either).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With