Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

"Stratify" parameter from sklearn's train_test_split not working correctly?

I have a problem with the stratify parameter in the train_test_split() function of scikit-learn. This is a dummy example with the same problem that appears randomly on my data:

from sklearn.model_selection import train_test_split
a = [1, 0, 0, 0, 0, 0, 0, 1]
train_test_split(a, stratify=a, random_state=42)

which returns:

[[1, 0, 0, 0, 0, 1], [0, 0]]

Shouldn't it select a "1" also in the test subset? From how I expect train_test_split() with stratify to work it should return something like:

[[1, 0, 0, 0, 0, 0], [0, 1]]

This happens with some values of random_state, while with other values it works correctly; but I cannot search for a "right" value of it every time I have to analyse data.

I have python 2.7 and scikit-learn 0.18.

like image 208
Hantaa Avatar asked Oct 04 '16 15:10

Hantaa


People also ask

Should you stratify train test split?

As such, it is desirable to split the dataset into train and test sets in a way that preserves the same proportions of examples in each class as observed in the original dataset. This is called a stratified train-test split.

What is stratify attribute in train_test_split function?

In this context, stratification means that the train_test_split method returns training and test subsets that have the same proportions of class labels as the input dataset.

Why You Should not Trust the train_test_split () function?

The train_test_split() function is provided by the scikit-learn Python package. Usually, we do not care much about the effects of using this function, because with a single line of code we obtain the division of the dataset into two parts, train and test set. Indeed, using this function could be dangerous.

What is stratify parameter?

The stratify parameter asks whether you want to retain the same proportion of classes in the train and test sets that are found in the entire original dataset. For example, if there are 100 observations in the entire original dataset of which 80 are class a and 20 are class b and you set stratify = True , with a .


1 Answers

This question was asked 8 months ago but I guess an answer might still help readers in the future.

When using the stratify parameter, train_test_split actually relies on the StratifiedShuffleSplit function to do the split. As you see in the documentation, StratifiedShuffleSplit does aim to do the split by preserving the percentage of samples for each class, as you expected.

The problem is, in your example 25% (2 of 8 samples) are 1s, but the sample size is not large enough for you to see this proportion reflected on the test set. You have two options here:

A. Increase the size of the test set with the option test_size, which defaults to 0.25, to say 0.5. In this case, half of your samples will become your test set, and you'll see that 25% of them (i.e. 1 in 4) are 1.

>>> a = [1, 0, 0, 0, 0, 0, 0, 1]
>>> train_test_split(a, stratify=a, random_state=42, test_size=0.5)
[[1, 0, 0, 0], [0, 0, 1, 0]]

B. Keep test_size to its default value and increase the size of your set a so that 25% of its samples amount to at least 4 elements. An a of 16 samples or more will do that for you.

>>> a = [1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1]
>>> train_test_split(a, stratify=a, random_state=42)
[[0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0], [0, 0, 1, 0]]

Hope that helps.

like image 77
DanielP Avatar answered Sep 27 '22 02:09

DanielP