I want to implement a machine learning algorithm in scikit learn, but I don't understand what this parameter <code>random_state</code> does? Why should I use it? I also could not understand what is a Pseudo-random number.

<code>train_test_split</code> splits arrays or matrices into random train and test subsets. That means that everytime you run it without specifying <code>random_state</code>, you will get a different result, this is expected behavior. For example: Run 1: <pre class="prettyprint"><code>>>> a, b = np.arange(10).reshape((5, 2)), range(5) >>> train_test_split(a, b) [array([[6, 7], [8, 9], [4, 5]]), array([[2, 3], [0, 1]]), [3, 4, 2], [1, 0]] </code></pre> Run 2 <pre class="prettyprint"><code>>>> train_test_split(a, b) [array([[8, 9], [4, 5], [0, 1]]), array([[6, 7], [2, 3]]), [4, 2, 0], [3, 1]] </code></pre> It changes. On the other hand if you use <code>random_state=some_number</code>, then you can guarantee that the output of Run 1 will be equal to the output of Run 2, i.e. your split will be always the same. It doesn't matter what the actual <code>random_state</code> number is 42, 0, 21, ... The important thing is that everytime you use 42, you will always get the same output the first time you make the split. This is useful if you want reproducible results, for example in the documentation, so that everybody can consistently see the same numbers when they run the examples. In practice I would say, you should set the <code>random_state</code> to some fixed number while you test stuff, but then remove it in production if you really need a random (and not a fixed) split. Regarding your second question, a pseudo-random number generator is a number generator that generates almost truly random numbers. Why they are not truly random is out of the scope of this question and probably won't matter in your case, you can take a look here form more details.

If you don't specify the <code>random_state</code> in your code, then every time you run(execute) your code a new random value is generated and the train and test datasets would have different values each time. However, if a fixed value is assigned like <code>random_state = 42</code> then no matter how many times you execute your code the result would be the same .i.e, same values in train and test datasets.

Random state (Pseudo-random number) in Scikit learn

6 Answers

train_test_split splits arrays or matrices into random train and test subsets. That means that everytime you run it without specifying random_state, you will get a different result, this is expected behavior. For example:

Run 1:

>>> a, b = np.arange(10).reshape((5, 2)), range(5)
>>> train_test_split(a, b)
[array([[6, 7],
        [8, 9],
        [4, 5]]),
 array([[2, 3],
        [0, 1]]), [3, 4, 2], [1, 0]]

Run 2

>>> train_test_split(a, b)
[array([[8, 9],
        [4, 5],
        [0, 1]]),
 array([[6, 7],
        [2, 3]]), [4, 2, 0], [3, 1]]

It changes. On the other hand if you use random_state=some_number, then you can guarantee that the output of Run 1 will be equal to the output of Run 2, i.e. your split will be always the same. It doesn't matter what the actual random_state number is 42, 0, 21, ... The important thing is that everytime you use 42, you will always get the same output the first time you make the split. This is useful if you want reproducible results, for example in the documentation, so that everybody can consistently see the same numbers when they run the examples. In practice I would say, you should set the random_state to some fixed number while you test stuff, but then remove it in production if you really need a random (and not a fixed) split.

Regarding your second question, a pseudo-random number generator is a number generator that generates almost truly random numbers. Why they are not truly random is out of the scope of this question and probably won't matter in your case, you can take a look here form more details.

133

answered Sep 29 '22 13:09

elyase

If you don't specify the random_state in your code, then every time you run(execute) your code a new random value is generated and the train and test datasets would have different values each time.

However, if a fixed value is assigned like random_state = 42 then no matter how many times you execute your code the result would be the same .i.e, same values in train and test datasets.

answered Sep 25 '22 13:09

umar salman

Well the question what is "random state" and why is it used, has been answered above nicely by people above. I will try and answer the question "Why do we choose random state as 42 very often during training a machine learning model? why we dont choose 12 or 32 or 5? " Is there a scientific explanation?

To be specific, 42 has nothing to do with AI or ML. It is actually a generic number, In Machine Learning, it doesn't matter what the actual random number is, as mentioned in scikit API doc, any INTEGER is sufficient enough for the task at hand.

42 is a reference from Hitchhikers guide to galaxy book. The answer to life universe and everything and is meant as a joke. It has no other significance.

References:

Wikipedia: on Hitchhikers guide to galaxy
Stack Exchange: Why the Number 42 is preferred when indicating something random
Why the Number 42
Quora: Why the Number 42 is preferred when indicating something random
YouTube: Nice Simple video explaining use of random state in train-test-split

The significance of number 42!

answered Sep 26 '22 13:09

Achal Kagwad

If you don't mention the random_state in the code, then whenever you execute your code a new random value is generated and the train and test datasets would have different values each time.

However, if you use a particular value for random_state(random_state = 1 or any other value) everytime the result will be same,i.e, same values in train and test datasets. Refer below code:

import pandas as pd 
from sklearn.model_selection import train_test_split
test_series = pd.Series(range(100))
size30split = train_test_split(test_series,random_state = 1,test_size = .3)
size25split = train_test_split(test_series,random_state = 1,test_size = .25)
common = [element for element in size25split[0] if element in size30split[0]]
print(len(common))

Doesn't matter how many times you run the code, the output will be 70.

Try to remove the random_state and run the code.

import pandas as pd 
from sklearn.model_selection import train_test_split
test_series = pd.Series(range(100))
size30split = train_test_split(test_series,test_size = .3)
size25split = train_test_split(test_series,test_size = .25)
common = [element for element in size25split[0] if element in size30split[0]]
print(len(common))

Now here output will be different each time you execute the code.

answered Sep 28 '22 13:09

Rishi Bansal

random_state number splits the test and training datasets with a random manner. In addition to what is explained here, it is important to remember that random_state value can have significant effect on the quality of your model (by quality I essentially mean accuracy to predict). For instance, If you take a certain dataset and train a regression model with it, without specifying the random_state value, there is the potential that everytime, you will get a different accuracy result for your trained model on the test data. So it is important to find the best random_state value to provide you with the most accurate model. And then, that number will be used to reproduce your model in another occasion such as another research experiment. To do so, it is possible to split and train the model in a for-loop by assigning random numbers to random_state parameter:

for j in range(1000):

            X_train, X_test, y_train, y_test = train_test_split(X, y , random_state =j,     test_size=0.35)
            lr = LarsCV().fit(X_train, y_train)

            tr_score.append(lr.score(X_train, y_train))
            ts_score.append(lr.score(X_test, y_test))

        J = ts_score.index(np.max(ts_score))

        X_train, X_test, y_train, y_test = train_test_split(X, y , random_state =J, test_size=0.35)
        M = LarsCV().fit(X_train, y_train)
        y_pred = M.predict(X_test)`

answered Sep 28 '22 13:09

Arad Haselirad

If there is no randomstate provided the system will use a randomstate that is generated internally. So, when you run the program multiple times you might see different train/test data points and the behavior will be unpredictable. In case, you have an issue with your model you will not be able to recreate it as you do not know the random number that was generated when you ran the program.

If you see the Tree Classifiers - either DT or RF, they try to build a try using an optimal plan. Though most of the times this plan might be the same there could be instances where the tree might be different and so the predictions. When you try to debug your model you may not be able to recreate the same instance for which a Tree was built. So, to avoid all this hassle we use a random_state while building a DecisionTreeClassifier or RandomForestClassifier.

PS: You can go a bit in depth on how the Tree is built in DecisionTree to understand this better.

randomstate is basically used for reproducing your problem the same every time it is run. If you do not use a randomstate in traintestsplit, every time you make the split you might get a different set of train and test data points and will not help you in debugging in case you get an issue.

From Doc:

If int, randomstate is the seed used by the random number generator; If RandomState instance, randomstate is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

answered Sep 29 '22 13:09

MdNazmulHossain

Related questions
                            
                                numpy.where() detailed, step-by-step explanation / examples [closed]
                            
                                How to activate an Anaconda environment
                            
                                Set Colorbar Range in matplotlib
                            
                                Convert columns into rows with Pandas
                            
                                A column-vector y was passed when a 1d array was expected
                            
                                How to unnest (explode) a column in a pandas DataFrame
                            
                                In Python, how do I create a string of n characters in one line of code?
                            
                                Check if a value exists in pandas dataframe index
                            
                                "Permission Denied" trying to run Python on Windows 10
                            
                                How to check if variable is string with python 2 and 3 compatibility
                            
                                TypeError: 'NoneType' object is not iterable in Python
                            
                                Syntax behind sorted(key=lambda: ...)
                            
                                Return HTTP status code 201 in flask
                            
                                Format output string, right alignment
                            
                                How to convert a string with comma-delimited items to a list in Python?
                            
                                How can I install Python's pip3 on my Mac?
                            
                                Pandas convert dataframe to array of tuples
                            
                                What does from __future__ import absolute_import actually do?
                            
                                Is there something like RStudio for Python? [closed]
                            
                                BaseException.message deprecated in Python 2.6

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Random state (Pseudo-random number) in Scikit learn

Tags:

python

random

scikit-learn

Elizabeth Susan Joseph

People also ask