Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to perform MultiLabel stratified sampling?

I'm dealing with multi-labelled data, and I would like to use stratify sampling. Let's assume I have 10 classes and let's call them 'ABCDEFGHIJ'. I have a dataframe with 10 columns corresponding to each of the label containing the rest of the info about the entries. I can extract those 10 columns in a n_entry*10 matrix that I will refer to as label_values

For instance, a line of label_values looks like [0,0,1,1,0,0,0,0,0,0] and this specific line means that the entry has Label C and Label D.

I would like to perform a split of my data in a training and validation set, and I would like to have the same proportion of each label in training and validation. To perform my splitting, I was using Sklearn train_test_split function (before my need to stratify), which happens to have an argument stratify. The current behaviour is to make the multi_label behaviour into a multiclass one (We consider [A,B] to be a brand new class totally different from class A and class B). As a result there are some classes with only 1 element, and this triggers an error :

ValueError("The least populated class in y has only 1"
                         " member, which is too few. The minimum"
                         " number of groups for any class cannot"
                         " be less than 2.")

coming from sklearn/model_selection/_split.py from the _iter_indices of the StratifiedShuffleSplit Class :

if np.min(class_counts) < 2:
        raise ValueError("The least populated class in y has only 1"
                         " member, which is too few. The minimum"
                         " number of groups for any class cannot"
                         " be less than 2.")

My fix was to override this method to delete this check. This works, and I get better repartition of my labels between train and validation. However, one of my labels with 2 elements is entirely in the train set. Is that normal?

Other question : Is this the good way to procede about this, or do you think there is a better way to get stratify train_test_split in the multi_label?

like image 800
Statistic Dean Avatar asked Nov 19 '18 16:11

Statistic Dean


People also ask

What is meant by stratified sampling?

Stratified random sampling refers to a sampling technique in which a population is divided into discrete units called strata based on similar attributes, and the selection is done in a manner that is representative of the whole population. The sampling technique is preferred in heterogeneous populations because it minimizes selection bias and ...

What is stratified random sampling in Python?

One commonly used sampling method is stratified random sampling, in which a population is split into groups and a certain number of members from each group are randomly selected to be included in the sample. This tutorial explains two methods for performing stratified random sampling in Python.

How many players from each team are included in the stratified sample?

Notice that two players from each team are included in the stratified sample. Once again suppose we have the following pandas DataFrame that contains data about 8 basketball players on 2 different teams:

Why are estimates generated from strata more precise than random sampling?

Estimates generated within strata are more precise than those from random sampling because dividing the population into homogenous groups often reduces sampling error and increases precision.


2 Answers

As you've noticed, stratification for scikit-learn's train_test_split() does not consider the labels individually, but rather as a "label set". This does not work well at all for multi-label data because the number of unique combinations grows exponentially with the number of labels. In your example, there are 1024 different possible label combinations. You'd need to have at least twice that to perform a two-way split and even then you'd only get one example of each combination per split.

Your split with the check disabled is probably somewhat effective because duplicate label sets were able to stratify, but for unique label sets, you're simply allowing scikit-learn to split them randomly, which is not useful or effective.

An algorithm was proposed in 2011 by Sechidis, Tsoumakas and Vlahavas called Iterative Stratification that splits a multi-label dataset by considering each label separately, starting from the one with the fewest positive examples and working its way to the best represented one.

There are currently two implementations of this that you can use:

  1. iterative-stratification
  2. scikit-multilearn's iterative_train_test_split()

Say you want a two-way split for these 3-label (L1,L2,L3) samples:

L1 L2 L3
--------
0  0  0
0  0  1
0  1  0
0  1  1
1  0  0 
1  0  1
1  1  0
1  1  1

There are 8 unique label sets, yet each label has 4 positive examples. Instead of a random split, iterative stratification would attempt to give you two splits containing a balanced number of examples from each label. An example split could look like this:

Split 1
-------
L1 L2 L3
0  0  1
0  1  0
1  0  1
1  1  0

Split 2
-------
L1 L2 L3
0  0  0
0  1  1
1  0  0
1  1  1

Notice that each label now has a nice, even balance across the splits, even though each label set remains unique.

like image 113
Steven Avatar answered Nov 10 '22 01:11

Steven


The simplest solution for you is to use multi-label stratification with skmultilearn. Quick example:

from skmultilearn.model_selection import iterative_train_test_split
t_train, y_train, t_test, y_test = iterative_train_test_split(X, y, test_size = 0.2)

Please take into consideration that iterative stratification is slow and may be quite time consuming for big datasets.

like image 30
SvGA Avatar answered Nov 10 '22 00:11

SvGA