Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sklearn random state not random

I have been playing around with the random state variable from StratifiedKFold in sklearn, but it does not seem to be random. I believe that setting random_state=5, should give me a different testing set then setting random_state=4, but this does not seem to be the case. I have created some crude reproducible code below. First I load my data:

import numpy as np
from sklearn.cross_validation import StratifiedKFold
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target

Then I set random_state=5, for which I store the last values:

skf=StratifiedKFold(n_splits=5,random_state=5)
for (train, test) in skf.split(X,y): full_test_1=test
full_test_1

array([ 40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  90,  91,  92,
        93,  94,  95,  96,  97,  98,  99, 140, 141, 142, 143, 144, 145,
       146, 147, 148, 149])

Doing the same procedure for random_state=4:

skf=StratifiedKFold(n_splits=5,random_state=4)
for (train, test) in skf.split(X,y): full_test_2=test
full_test_2

array([ 40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  90,  91,  92,
        93,  94,  95,  96,  97,  98,  99, 140, 141, 142, 143, 144, 145,
       146, 147, 148, 149])

I can then check that they are equal:

np.array_equal(full_test_1,full_test_2)
True

I do not think that the two random states should be returning the same numbers. Is there a flaw in my logic or code?

like image 442
Bobe Kryant Avatar asked May 17 '17 15:05

Bobe Kryant


People also ask

Why do we use random_state 42?

Many students and practitioners use this number(42) as random state is because it is used by many instructors in online courses. They often set the random state or numpy seed to number 42 and learners follow the same practice without giving it much thought. To be specific, 42 has nothing to do with AI or ML.

What is random state in Sklearn train_test_split?

The random state hyperparameter in the train_test_split() function controls the shuffling process. With random_state=None , we get different train and test sets across different executions and the shuffling process is out of control. With random_state=0 , we get the same train and test sets across different executions.

What is the use of random_state 85?

the random_state parameter is used for initializing the internal random number generator, which will decide the splitting of data into train and test indices in your case.

What is random state in Scikit-learn?

random state: Whenever randomization is part of a Scikit-learn algorithm, a random_state parameter may be provided to control the random number generator used.


1 Answers

From the linked docs

random_state : None, int or RandomState

When shuffle=True, pseudo-random number generator state used for shuffling. If None, use default numpy RNG for shuffling.

You aren't setting shuffle=True in your call to StratifiedKFold, so random_state won't do anything.

like image 171
Personman Avatar answered Nov 03 '22 03:11

Personman