Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Not able to use Stratified-K-Fold on multi label classifier

The following code is used to do KFold Validation but I am to train the model as it is throwing the error

ValueError: Error when checking target: expected dense_14 to have shape (7,) but got array with shape (1,)

My target Variable has 7 classes. I am using LabelEncoder to encode the classes into numbers.

By seeing this error, If I am changing the into MultiLabelBinarizer to encode the classes. I am getting the following error

ValueError: Supported target types are: ('binary', 'multiclass'). Got 'multilabel-indicator' instead.

The following is the code for KFold validation

skf = StratifiedKFold(n_splits=10, shuffle=True)
scores = np.zeros(10)
idx = 0
for index, (train_indices, val_indices) in enumerate(skf.split(X, y)):
    print("Training on fold " + str(index+1) + "/10...")
    # Generate batches from indices
    xtrain, xval = X[train_indices], X[val_indices]
    ytrain, yval = y[train_indices], y[val_indices]
    model = None
    model = load_model() //defined above

    scores[idx] = train_model(model, xtrain, ytrain, xval, yval)
    idx+=1
print(scores)
print(scores.mean())

I don't know what to do. I want to use Stratified K Fold on my model. Please help me.

like image 642
Sai Pavan Avatar asked Feb 26 '19 17:02

Sai Pavan


People also ask

Why is stratified k-fold cross-validation better than k-fold cross-validation?

KFold is a cross-validator that divides the dataset into k folds. Stratified is to ensure that each fold of dataset has the same proportion of observations with a given label.

How do you use stratified cross-validation?

In machine learning, When we want to train our ML model we split our entire dataset into training_set and test_set using train_test_split() class present in sklearn. Then we train our model on training_set and test our model on test_set.

What is stratified cross-validation and when should we use it?

Implementing the concept of stratified sampling in cross-validation ensures the training and test sets have the same proportion of the feature of interest as in the original dataset. Doing this with the target variable ensures that the cross-validation result is a close approximation of generalization error.

How does the classifier stratify multi-label data?

The classifier follows methods outlined in Sechidis11 and Szymanski17 papers related to stratyfing multi-label data. In general what we expect from a given stratification output is that a strata, or a fold, is close to a given, demanded size, usually equal to 1/k in k-fold approach, or a x% train to test set division in 2-fold splits.

How to use startefiedkfold instead of multi-label classification?

You can now look at your problem as simple multi-class classification, instead of multi-label classification. Now you can directly use StartefiedKFold using y_new as your target. Once the splits are done, you can map your labels back. Thanks for contributing an answer to Stack Overflow!

What is stratified k-fold cross-validator?

Stratified K-Folds cross-validator. Provides train/test indices to split data in train/test sets. This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.

Are all the folds stratified k-fold?

And it seems Ok so far. All the folds contain stratified k-fold samples, len (df_folds [df_folds ['fold'] == fold_number].index) and no intersection to each other, set (A).intersection (B) where A and B are the index value ( image_id) of two folds. But the issue seems like:


1 Answers

MultiLabelBinarizer returns a vector which is of the length of your number of classes.

If you look at how StratifiedKFold splits your dataset, you will see that it only accepts a one-dimensional target variable, whereas you are trying to pass a target variable with dimensions [n_samples, n_classes]

Stratefied split basically preserves your class distribution. And if you think about it, it does not make a lot of sense if you have a multi-label classification problem.

If you want to preserve the distribution in terms of the different combinations of classes in your target variable, then the answer here explains two ways in which you can define your own stratefied split function.

UPDATE:

The logic is something like this:

Assuming you have n classes and your target variable is a combination of these n classes. You will have (2^n) - 1 combinations (Not including all 0s). You can now create a new target variable considering each combination as a new label.

For example, if n=3, you will have 7 unique combinations:

 1. [1, 0, 0]
 2. [0, 1, 0]
 3. [0, 0, 1]
 4. [1, 1, 0]
 5. [1, 0, 1]
 6. [0, 1, 1]
 7. [1, 1, 1]

Map all your labels to this new target variable. You can now look at your problem as simple multi-class classification, instead of multi-label classification.

Now you can directly use StartefiedKFold using y_new as your target. Once the splits are done, you can map your labels back.

Code sample:

import numpy as np

np.random.seed(1)
y = np.random.randint(0, 2, (10, 7))
y = y[np.where(y.sum(axis=1) != 0)[0]]

OUTPUT:

array([[1, 1, 0, 0, 1, 1, 1],
       [1, 1, 0, 0, 1, 0, 1],
       [1, 0, 0, 1, 0, 0, 0],
       [1, 0, 0, 1, 0, 0, 0],
       [1, 0, 0, 0, 1, 1, 1],
       [1, 1, 0, 0, 0, 1, 1],
       [1, 1, 1, 1, 0, 1, 1],
       [0, 0, 1, 0, 0, 1, 1],
       [1, 0, 1, 0, 0, 1, 1],
       [0, 1, 1, 1, 1, 0, 0]])

Label encode your class vectors:

from sklearn.preprocessing import LabelEncoder

def get_new_labels(y):
    y_new = LabelEncoder().fit_transform([''.join(str(l)) for l in y])
    return y_new

y_new = get_new_labels(y)

OUTPUT:

array([7, 6, 3, 3, 2, 5, 8, 0, 4, 1])
like image 172
panktijk Avatar answered Oct 10 '22 07:10

panktijk