I'm creating dataset for testing with
import random
from sklearn.datasets import make_regression
random.seed(10)
X, y = make_regression(n_samples = 1000, n_features = 10)
X[0:2]
Could you please explain why I get a different dataset after each running? For example, 2 runs return
array([[-0.28058959, -0.00570283, 0.31728106, 0.52745066, 1.69651572,
-0.37038286, 0.67825801, -0.71782482, -0.29886242, 0.07891646],
[ 0.73872413, -0.27472164, -1.70298606, -0.59211593, 0.04060707,
1.39661574, -1.25656819, -0.79698442, -0.38533316, 0.65484856]])
and
array([[ 0.12493586, 1.01388974, 1.2390685 , -0.13797227, 0.60029193,
-1.39268898, -0.49804303, 1.31267837, 0.11774784, 0.56224193],
[ 0.47067323, 0.3845262 , 1.22959284, -0.02913909, -1.56481745,
-1.56479078, 2.04082295, -0.22561445, -0.37150552, 0.91750366]])
You need to put the seed into the make_regression
call as parameter:
sklearn.datasets.make_regression(n_samples=100, n_features=100, n_informative=10,
n_targets=1, bias=0.0, effective_rank=None,
tail_strength=0.5, noise=0.0, shuffle=True,
coef=False, random_state= None )
^°^°^°^°^°^°^°^°^°
See API:
random_state : int, RandomState instance or None, optional (default=None)
If int,
random_state
is the seed used by the random number generator; If RandomState instance,random_state
is the random number generator; IfNone
, the random number generator is the RandomState instance used bynp.random
.
So in your case:
X, y = make_regression(n_samples = 1000, n_features = 10, random_state = 10)
Although setting the random_state
argument in make_regression
, as already suggested, resolves the issue, it would arguably be useful to clarify exactly the reason why your own code snippet does not work as expected. And the answer is that, as implied in the docs, make_regression
uses the random number generator (RNG) from Numpy, and not from the Python random
module used in your code.
So, changing only slightly your code snippet to
import numpy as np # change 1
from sklearn.datasets import make_regression
np.random.seed(10) # change 2
X, y = make_regression(n_samples = 1000, n_features = 10) # no random_state set here
X[0:2]
always results in the same dataset:
array([[-1.32553507, -1.34894938, -0.82160306, 0.03538905, -0.68611315,
-0.74469132, 1.37391771, 0.98675482, -0.90921643, -1.57943748],
[ 1.13660812, 0.52367005, 0.05090828, -0.47210149, -0.98592548,
-0.69677968, 0.31752274, -0.0771912 , 2.17548753, 0.75189637]])
which is actually identical with the one yielded with setting random_state=10
in make_regression
:
X, y = make_regression(n_samples = 1000, n_features = 10, random_state=10)
X[0:2]
# result:
array([[-1.32553507, -1.34894938, -0.82160306, 0.03538905, -0.68611315,
-0.74469132, 1.37391771, 0.98675482, -0.90921643, -1.57943748],
[ 1.13660812, 0.52367005, 0.05090828, -0.47210149, -0.98592548,
-0.69677968, 0.31752274, -0.0771912 , 2.17548753, 0.75189637]])
For more on RNGs, you may find own answer in Are random seeds compatible between systems? useful.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With