I have been playing around with the random state variable from StratifiedKFold in sklearn, but it does not seem to be random. I believe that setting random_state=5
, should give me a different testing set then setting random_state=4
, but this does not seem to be the case. I have created some crude reproducible code below. First I load my data:
import numpy as np
from sklearn.cross_validation import StratifiedKFold
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
Then I set random_state=5
, for which I store the last values:
skf=StratifiedKFold(n_splits=5,random_state=5)
for (train, test) in skf.split(X,y): full_test_1=test
full_test_1
array([ 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 90, 91, 92,
93, 94, 95, 96, 97, 98, 99, 140, 141, 142, 143, 144, 145,
146, 147, 148, 149])
Doing the same procedure for random_state=4
:
skf=StratifiedKFold(n_splits=5,random_state=4)
for (train, test) in skf.split(X,y): full_test_2=test
full_test_2
array([ 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 90, 91, 92,
93, 94, 95, 96, 97, 98, 99, 140, 141, 142, 143, 144, 145,
146, 147, 148, 149])
I can then check that they are equal:
np.array_equal(full_test_1,full_test_2)
True
I do not think that the two random states should be returning the same numbers. Is there a flaw in my logic or code?
Many students and practitioners use this number(42) as random state is because it is used by many instructors in online courses. They often set the random state or numpy seed to number 42 and learners follow the same practice without giving it much thought. To be specific, 42 has nothing to do with AI or ML.
The random state hyperparameter in the train_test_split() function controls the shuffling process. With random_state=None , we get different train and test sets across different executions and the shuffling process is out of control. With random_state=0 , we get the same train and test sets across different executions.
the random_state parameter is used for initializing the internal random number generator, which will decide the splitting of data into train and test indices in your case.
random state: Whenever randomization is part of a Scikit-learn algorithm, a random_state parameter may be provided to control the random number generator used.
From the linked docs
random_state : None, int or RandomState
When shuffle=True, pseudo-random number generator state used for shuffling. If None, use default numpy RNG for shuffling.
You aren't setting shuffle=True in your call to StratifiedKFold, so random_state won't do anything.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With