Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between Shuffle and Random_State in train test split?

I tried both on a small dataset sample and it returned the same output. So the question is, what is the difference between the "shuffle" and the "random_state" parameter in scikit's train-test-split method?

Code for MWE:

X, y = np.arange(10).reshape((5, 2)), range(5)
train_test_split(y, shuffle=False)

Out: [[0, 1, 2], [3, 4]]

train_test_split(y, random_state=0)

Out: [[0, 1, 2], [3, 4]]

like image 265
EchoCache Avatar asked Nov 20 '19 13:11

EchoCache


People also ask

What is shuffle in train test split?

The shuffle parameter is needed to prevent non-random assignment to to train and test set. With shuffle=True you split the data randomly.

Does random state shuffle data?

random_state will set a seed for reproducibility of the results, whereas shuffle sets whether the train and tests sets are made of from a shuffled array or not (if set to False, all the n first observations in your array will go in the train dataset, and all others in the test dataset).

What should random state be split in train test?

Whenever used Scikit-learn algorithm (sklearn. model_selection. train_test_split), is recommended to used the parameter ( random_state=42) to produce the same results across a different run.

What is the purpose of setting random_state when splitting the dataset?

the random_state parameter is used for initializing the internal random number generator, which will decide the splitting of data into train and test indices in your case.


2 Answers

Sometimes experimenting may help understand how a function works.

Say if you have a DataFrame of the sort:

   X  Y
0  A  2
1  A  3
2  A  2
3  B  0
4  B  0

We'll go over the different things that you can do with the function train_test_split:


  • if you input train, test = train_test_split(df, test_size=2/5, shuffle=False, random_state=None), you will always end up with:
# TRAIN
   X  Y
0  A  2
1  A  3
2  A  2

#TEST
   X  Y
3  B  0
4  B  0

  • if you input train, test = train_test_split(df, test_size=2/5, shuffle=False, random_state=1) or any other int for random_state, you will get the same:
# TRAIN
   X  Y
0  A  2
1  A  3
2  A  2

#TEST
   X  Y
3  B  0
4  B  0

This comes from the fact that you decided not to shuffle your dataset, so random_state is not used by the function.


  • Now, if you do train, test = train_test_split(df, test_size=2/5, shuffle=True, random_state=None), you will get a dataset that looks like this:
# TRAIN
   X  Y
4  B  0
0  A  2
1  A  3

# TEST
   X  Y
2  A  2
3  B  0

Note that entries have been shuffled. But note as well that if you run your code again, results might differ.


  • Finally, if you do train, test = train_test_split(df, test_size=2/5, shuffle=True, random_state=1) or any other int for random_state, you will get two datasets with shuffled entries as well:
# TRAIN
   X  Y
4  B  0
0  A  2
3  B  0

# TEST
   X  Y
2  A  2
1  A  3

Only, this time, if you run the code again with the same random_state, the output will always remain the same. You have set a seed, which is useful for reproducibility of the results!

like image 151
bglbrt Avatar answered Oct 20 '22 13:10

bglbrt


  • random_state controls the pseudo-random numpy generator. For the reproducibility of the code, a random_state should be specified.

  • shuffle: if True then it shuffles the data before splitting

More details:

random_state : int, RandomState instance or None, optional (default=None) If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

shuffle : boolean, optional (default=True) Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.

like image 32
seralouk Avatar answered Oct 20 '22 13:10

seralouk