I tried both on a small dataset sample and it returned the same output. So the question is, what is the difference between the "shuffle" and the "random_state" parameter in scikit's train-test-split method?
Code for MWE:
X, y = np.arange(10).reshape((5, 2)), range(5)
train_test_split(y, shuffle=False)
Out: [[0, 1, 2], [3, 4]]
train_test_split(y, random_state=0)
Out: [[0, 1, 2], [3, 4]]
The shuffle parameter is needed to prevent non-random assignment to to train and test set. With shuffle=True you split the data randomly.
random_state will set a seed for reproducibility of the results, whereas shuffle sets whether the train and tests sets are made of from a shuffled array or not (if set to False, all the n first observations in your array will go in the train dataset, and all others in the test dataset).
Whenever used Scikit-learn algorithm (sklearn. model_selection. train_test_split), is recommended to used the parameter ( random_state=42) to produce the same results across a different run.
the random_state parameter is used for initializing the internal random number generator, which will decide the splitting of data into train and test indices in your case.
Sometimes experimenting may help understand how a function works.
Say if you have a DataFrame of the sort:
X Y
0 A 2
1 A 3
2 A 2
3 B 0
4 B 0
We'll go over the different things that you can do with the function train_test_split
:
train, test = train_test_split(df, test_size=2/5, shuffle=False, random_state=None)
, you will always end up with:# TRAIN
X Y
0 A 2
1 A 3
2 A 2
#TEST
X Y
3 B 0
4 B 0
train, test = train_test_split(df, test_size=2/5, shuffle=False, random_state=1)
or any other int for random_state
, you will get the same:# TRAIN
X Y
0 A 2
1 A 3
2 A 2
#TEST
X Y
3 B 0
4 B 0
This comes from the fact that you decided not to shuffle your dataset, so
random_state
is not used by the function.
train, test = train_test_split(df, test_size=2/5, shuffle=True, random_state=None)
, you will get a dataset that looks like this:# TRAIN
X Y
4 B 0
0 A 2
1 A 3
# TEST
X Y
2 A 2
3 B 0
Note that entries have been shuffled. But note as well that if you run your code again, results might differ.
train, test = train_test_split(df, test_size=2/5, shuffle=True, random_state=1)
or any other int for random_state
, you will get two datasets with shuffled entries as well:# TRAIN
X Y
4 B 0
0 A 2
3 B 0
# TEST
X Y
2 A 2
1 A 3
Only, this time, if you run the code again with the same
random_state
, the output will always remain the same. You have set a seed, which is useful for reproducibility of the results!
random_state
controls the pseudo-random numpy generator. For the reproducibility of the code, a random_state should be specified.
shuffle
: if True then it shuffles the data before splitting
More details:
random_state : int, RandomState instance or None, optional (default=None) If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
shuffle : boolean, optional (default=True) Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With