I tried both on a small dataset sample and it returned the same output. So the question is, what is the difference between the "shuffle" and the "random_state" parameter in scikit's train-test-split method? Code for MWE: <pre class="prettyprint"><code>X, y = np.arange(10).reshape((5, 2)), range(5) train_test_split(y, shuffle=False) Out: [[0, 1, 2], [3, 4]] train_test_split(y, random_state=0) Out: [[0, 1, 2], [3, 4]] </code></pre>

Sometimes experimenting may help understand how a function works. Say if you have a DataFrame of the sort: <pre class="prettyprint"><code> X Y 0 A 2 1 A 3 2 A 2 3 B 0 4 B 0 </code></pre> We'll go over the different things that you can do with the function <code>train_test_split</code>: <hr> <ul> <li>if you input <code>train, test = train_test_split(df, test_size=2/5, shuffle=False, random_state=None)</code>, you will always end up with:</li> </ul> <pre class="prettyprint"><code># TRAIN X Y 0 A 2 1 A 3 2 A 2 #TEST X Y 3 B 0 4 B 0 </code></pre> <hr> <ul> <li>if you input <code>train, test = train_test_split(df, test_size=2/5, shuffle=False, random_state=1)</code> or any other int for <code>random_state</code>, you will get the same:</li> </ul> <pre class="prettyprint"><code># TRAIN X Y 0 A 2 1 A 3 2 A 2 #TEST X Y 3 B 0 4 B 0 </code></pre> <blockquote> This comes from the fact that you decided not to shuffle your dataset, so <code>random_state</code> is not used by the function. </blockquote> <hr> <ul> <li>Now, if you do <code>train, test = train_test_split(df, test_size=2/5, shuffle=True, random_state=None)</code>, you will get a dataset that looks like this:</li> </ul> <pre class="prettyprint"><code># TRAIN X Y 4 B 0 0 A 2 1 A 3 # TEST X Y 2 A 2 3 B 0 </code></pre> <blockquote> Note that entries have been shuffled. But note as well that if you run your code again, results might differ. </blockquote> <hr> <ul> <li>Finally, if you do <code>train, test = train_test_split(df, test_size=2/5, shuffle=True, random_state=1)</code> or any other int for <code>random_state</code>, you will get two datasets with shuffled entries as well:</li> </ul> <pre class="prettyprint"><code># TRAIN X Y 4 B 0 0 A 2 3 B 0 # TEST X Y 2 A 2 1 A 3 </code></pre> <blockquote> Only, this time, if you run the code again with the same <code>random_state</code>, the output will always remain the same. You have set a seed, which is useful for reproducibility of the results! </blockquote>

Difference between Shuffle and Random_State in train test split?

X, y = np.arange(10).reshape((5, 2)), range(5)
train_test_split(y, shuffle=False)

Out: [[0, 1, 2], [3, 4]]

train_test_split(y, random_state=0)

Out: [[0, 1, 2], [3, 4]]

265

asked Nov 20 '19 13:11

EchoCache

2 Answers

Sometimes experimenting may help understand how a function works.

Say if you have a DataFrame of the sort:

We'll go over the different things that you can do with the function train_test_split:

if you input train, test = train_test_split(df, test_size=2/5, shuffle=False, random_state=None), you will always end up with:

# TRAIN
   X  Y
0  A  2
1  A  3
2  A  2

#TEST
   X  Y
3  B  0
4  B  0

if you input train, test = train_test_split(df, test_size=2/5, shuffle=False, random_state=1) or any other int for random_state, you will get the same:

# TRAIN
   X  Y
0  A  2
1  A  3
2  A  2

#TEST
   X  Y
3  B  0
4  B  0

This comes from the fact that you decided not to shuffle your dataset, so random_state is not used by the function.

Now, if you do train, test = train_test_split(df, test_size=2/5, shuffle=True, random_state=None), you will get a dataset that looks like this:

# TRAIN
   X  Y
4  B  0
0  A  2
1  A  3

# TEST
   X  Y
2  A  2
3  B  0

Note that entries have been shuffled. But note as well that if you run your code again, results might differ.

Finally, if you do train, test = train_test_split(df, test_size=2/5, shuffle=True, random_state=1) or any other int for random_state, you will get two datasets with shuffled entries as well:

# TRAIN
   X  Y
4  B  0
0  A  2
3  B  0

# TEST
   X  Y
2  A  2
1  A  3

Only, this time, if you run the code again with the same random_state, the output will always remain the same. You have set a seed, which is useful for reproducibility of the results!

151

answered Oct 20 '22 13:10

bglbrt

random_state controls the pseudo-random numpy generator. For the reproducibility of the code, a random_state should be specified.
shuffle: if True then it shuffles the data before splitting

More details:

random_state : int, RandomState instance or None, optional (default=None) If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

shuffle : boolean, optional (default=True) Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.

answered Oct 20 '22 13:10

seralouk

Related questions
                            
                                How to specify a minimum or maximum float value with argparse
                            
                                Developing a Python library with Pipenv
                            
                                How to efficiently unroll a matrix by value with numpy?
                            
                                How to convert a Tableau .hyper File to a pandas dataframe?
                            
                                How to kill tensorboard with Tensorflow2 (jupyter, Win)
                            
                                How to split a dataframe based on consecutive index?
                            
                                python3 os.rename() won't rename files with the word 'Copy' in name
                            
                                How can I change the size of my python turtle window?
                            
                                Discord.py - Changing prefix with command
                            
                                How to use apache airflow in a virtual environment?
                            
                                How to interpret Python output dtype='<U32'?
                            
                                How to combine The video and audio files in ffmpeg-python
                            
                                Disable logging in gunicorn for a specific request / URL / endpoint
                            
                                adding row from one dataframe to another
                            
                                How to check if sklearn model is classifier or regressor
                            
                                How do I use pytest with bazel?
                            
                                Why Flask Migrations does not detect a field's length change?
                            
                                AttributeError: module 'win32ctypes.pywin32.win32api' has no attribute 'error'
                            
                                Dual nested dictionary to stacked DataFrame
                            
                                How to get a list of every Point inside a MultiPolygon using Shapely

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Difference between Shuffle and Random_State in train test split?

Tags:

python

machine-learning

scikit-learn

EchoCache

People also ask

2 Answers

bglbrt

seralouk

Recent Activity

Donate For Us