I have a problem with the <code>stratify</code> parameter in the <code>train_test_split()</code> function of scikit-learn. This is a dummy example with the same problem that appears randomly on my data: <pre class="prettyprint"><code>from sklearn.model_selection import train_test_split a = [1, 0, 0, 0, 0, 0, 0, 1] train_test_split(a, stratify=a, random_state=42) </code></pre> which returns: <pre class="prettyprint"><code>[[1, 0, 0, 0, 0, 1], [0, 0]] </code></pre> Shouldn't it select a "1" also in the test subset? From how I expect <code>train_test_split()</code> with <code>stratify</code> to work it should return something like: <pre class="prettyprint"><code>[[1, 0, 0, 0, 0, 0], [0, 1]] </code></pre> This happens with some values of <code>random_state</code>, while with other values it works correctly; but I cannot search for a "right" value of it every time I have to analyse data. I have python 2.7 and scikit-learn 0.18.

This question was asked 8 months ago but I guess an answer might still help readers in the future. When using the <code>stratify</code> parameter, <code>train_test_split</code> actually relies on the <code>StratifiedShuffleSplit</code> function to do the split. As you see in the documentation, <code>StratifiedShuffleSplit</code> does aim to do the split by preserving the percentage of samples for each class, as you expected. The problem is, in your example 25% (2 of 8 samples) are 1s, but the sample size is not large enough for you to see this proportion reflected on the test set. You have two options here: A. Increase the size of the test set with the option <code>test_size</code>, which defaults to 0.25, to say 0.5. In this case, half of your samples will become your test set, and you'll see that 25% of them (i.e. 1 in 4) are 1. <pre class="prettyprint"><code>>>> a = [1, 0, 0, 0, 0, 0, 0, 1] >>> train_test_split(a, stratify=a, random_state=42, test_size=0.5) [[1, 0, 0, 0], [0, 0, 1, 0]] </code></pre> B. Keep <code>test_size</code> to its default value and increase the size of your set <code>a</code> so that 25% of its samples amount to at least 4 elements. An <code>a</code> of 16 samples or more will do that for you. <pre class="prettyprint"><code>>>> a = [1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1] >>> train_test_split(a, stratify=a, random_state=42) [[0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0], [0, 0, 1, 0]] </code></pre> Hope that helps.

"Stratify" parameter from sklearn's train_test_split not working correctly?

Tags:

python

python-2.7

scikit-learn

I have a problem with the stratify parameter in the train_test_split() function of scikit-learn. This is a dummy example with the same problem that appears randomly on my data:

Click to copy

from sklearn.model_selection import train_test_split
a = [1, 0, 0, 0, 0, 0, 0, 1]
train_test_split(a, stratify=a, random_state=42)

which returns:

Click to copy

[[1, 0, 0, 0, 0, 1], [0, 0]]

Shouldn't it select a "1" also in the test subset? From how I expect train_test_split() with stratify to work it should return something like:

Click to copy

[[1, 0, 0, 0, 0, 0], [0, 1]]

This happens with some values of random_state, while with other values it works correctly; but I cannot search for a "right" value of it every time I have to analyse data.

I have python 2.7 and scikit-learn 0.18.

208

asked Oct 04 '16 15:10

Hantaa

1 Answers

This question was asked 8 months ago but I guess an answer might still help readers in the future.

When using the stratify parameter, train_test_split actually relies on the StratifiedShuffleSplit function to do the split. As you see in the documentation, StratifiedShuffleSplit does aim to do the split by preserving the percentage of samples for each class, as you expected.

The problem is, in your example 25% (2 of 8 samples) are 1s, but the sample size is not large enough for you to see this proportion reflected on the test set. You have two options here:

A. Increase the size of the test set with the option test_size, which defaults to 0.25, to say 0.5. In this case, half of your samples will become your test set, and you'll see that 25% of them (i.e. 1 in 4) are 1.

Click to copy

>>> a = [1, 0, 0, 0, 0, 0, 0, 1]
>>> train_test_split(a, stratify=a, random_state=42, test_size=0.5)
[[1, 0, 0, 0], [0, 0, 1, 0]]

B. Keep test_size to its default value and increase the size of your set a so that 25% of its samples amount to at least 4 elements. An a of 16 samples or more will do that for you.

Click to copy

>>> a = [1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1]
>>> train_test_split(a, stratify=a, random_state=42)
[[0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0], [0, 0, 1, 0]]

Hope that helps.

answered Sep 27 '22 02:09

DanielP

Related questions
                            
                                python pandas: get fiscal quarter from fiscal year and month (for UK)
                            
                                How to increment a date using Arrow?
                            
                                Filter values inside Python generator expressions
                            
                                How to skip a single loop iteration in python? [duplicate]
                            
                                Whats the difference between 'rb' and 'rU' in the open() function for csv
                            
                                Unable to get a single linebreak while sending email through Sendgrid
                            
                                Python 2 __missing__ method
                            
                                How convert output tensor to one-hot tensor?
                            
                                A DRY approach to Python try-except blocks?
                            
                                Python open html file, take screenshot, crop and save as image
                            
                                Reading in file block by block using specified delimiter in python
                            
                                python map function with min argument and two lists
                            
                                Django Error: Your URL pattern is invalid. Ensure that urlpatterns is a list of url() instances
                            
                                Function annotation for subclasses of abstract class
                            
                                Convert complex NumPy array into (n, 2)-array of real and imaginary parts
                            
                                pd.Timedelta conversion on a dataframe column
                            
                                Django form. How hide colon from initial_text?
                            
                                lxml xsi:schemaLocation namespace URI validation issue
                            
                                Install Matlab engine in Anaconda Python (Linux)
                            
                                how to trigger function in another object when variable changed. Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

"Stratify" parameter from sklearn's train_test_split not working correctly?

Tags:

python

python-2.7

scikit-learn

Hantaa

People also ask

1 Answers

DanielP

Recent Activity

Donate For Us