Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scikit-learn random state in splitting dataset

Can anyone tell me why we set random state to zero in splitting train and test set.

X_train, X_test, y_train, y_test = \     train_test_split(X, y, test_size=0.30, random_state=0) 

I have seen situations like this where random state is set to 1!

X_train, X_test, y_train, y_test = \     train_test_split(X, y, test_size=0.30, random_state=1) 

What is the consequence of this random state in cross validation as well?

like image 586
Shelly Avatar asked Feb 12 '17 18:02

Shelly


People also ask

What is the random state in train test split?

The random state hyperparameter in the train_test_split() function controls the shuffling process. With random_state=None , we get different train and test sets across different executions and the shuffling process is out of control. With random_state=0 , we get the same train and test sets across different executions.

Does Train_test_split split randomly?

EXAMPLE 3: Use random_state to make a repeatable split That's because train_test_split allocates rows of data to the output randomly . Therefore, every time you run train_test_split with the default settings, the output data will contain observations that are randomly selected from the input data.

What is random_state in Sklearn?

The random state is simply the lot number of the set generated randomly in any operation. We can specify this lot number whenever we want the same set again.

What is test size and random state in Sklearn Train_test_split?

You should provide either train_size or test_size . If neither is given, then the default share of the dataset that will be used for testing is 0.25 , or 25 percent. random_state is the object that controls randomization during splitting. It can be either an int or an instance of RandomState .


2 Answers

It doesn't matter if the random_state is 0 or 1 or any other integer. What matters is that it should be set the same value, if you want to validate your processing over multiple runs of the code. By the way I have seen random_state=42 used in many official examples of scikit as well as elsewhere also.

random_state as the name suggests, is used for initializing the internal random number generator, which will decide the splitting of data into train and test indices in your case. In the documentation, it is stated that:

If random_state is None or np.random, then a randomly-initialized RandomState object is returned.

If random_state is an integer, then it is used to seed a new RandomState object.

If random_state is a RandomState object, then it is passed through.

This is to check and validate the data when running the code multiple times. Setting random_state a fixed value will guarantee that same sequence of random numbers are generated each time you run the code. And unless there is some other randomness present in the process, the results produced will be same as always. This helps in verifying the output.

like image 121
Vivek Kumar Avatar answered Sep 21 '22 06:09

Vivek Kumar


If you don't mention the random_state in the code, then whenever you execute your code a new random value is generated and the train and test datasets would have different values each time.

However, if you use a particular value for random_state(random_state = 1 or any other value) everytime the result will be same,i.e, same values in train and test datasets.

like image 27
Rishi Bansal Avatar answered Sep 20 '22 06:09

Rishi Bansal