Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What exactly does the Pandas random_state do?

I have the following code where I use the Pandas random_state

randomState = 123
sampleSize = 750
df = pd.read_csv(filePath, delim_whitespace=True)
df_s = df.sample(n=sampleSize, random_state=randomState)

This generates a sample dataframe df_s. Every time I run the code with the same randomState, I get the same sample df_s. When I change the value from 123 to 12 the sample changes as well, so I guess that's what the random_state does.

My silly question: How do the number change affect the sample change? I read the Pandas documentation and the Numpy documentation, but could not get a clear picture.

Any straight forward explanation with an example will be much appreciated.

like image 839
Newskooler Avatar asked Jul 20 '17 10:07

Newskooler


People also ask

What is random_state in Python?

The random_state is an integer value which implies the selection of a random combination of train and test. When you set the test_size as 1/4 the there is a set generated of permutation and combination of train and test and each combination has one state.

Why is the state 42 random?

Many students and practitioners use this number(42) as random state is because it is used by many instructors in online courses. They often set the random state or numpy seed to number 42 and learners follow the same practice without giving it much thought. To be specific, 42 has nothing to do with AI or ML.

What is random_state in Train_test_split?

The random state hyperparameter in the train_test_split() function controls the shuffling process. With random_state=None , we get different train and test sets across different executions and the shuffling process is out of control. With random_state=0 , we get the same train and test sets across different executions.


1 Answers

As described in the documentation of pandas.DataFrame.sample, the random_state parameter accepts either an integer (as in your case) or a numpy.random.RandomState, which is a container for a Mersenne Twister pseudo random number generator.

If you pass it an integer, it will use this as a seed for a pseudo random number generator. As the name already says, the generator does not produce true randomness. It rather has an internal state (that you can get by calling np.random.get_state()) which is initialized based on a seed. When initialized by the same seed, it will reproduce the same sequence of "random numbers".

If you pass it a RandomState it will use this (already initialized/seeded) RandomState to generate pseudo random numbers. This also allows you to get reproducible results by setting a fixed seed when initializing the RandomState and then passing this RandomState around. Actually you should prefer this over setting the seed of numpys internal RandomState. The reasoning being explained in this answer by Robert Kern and the comments to it. The idea is to have an independent stream which prevents other parts of the program to mess up your reproducibility by changing the seed of numpys internal RandomState.

like image 184
jotasi Avatar answered Oct 30 '22 19:10

jotasi