Regarding the seeding system when running machine learning algorithms with Scikit-Learn
, there are three different things usually mentioned:
random.seed
np.random.seed
random_state
at SkLearn
(cross-validation iterators, ML algorithms etc)I have already in my mind this FAQ of SkLearn
about how to fix the global seeding system and articles which point out that this should not be simply a FAQ.
My ultimate question is how can I get absolutely reproducible results when running an ML algorithm with SkLearn
?
In more detail,
np.random.seed
and do not specify any random_state
at SkLearn
then will my results be absolutely reproducible?and one question at least for the sake of knowledge:
np.random.seed
and random_state
of SkLearn
are internally related? How np.random.seed
affects the seeding system (random_state
) of SkLearn
and makes it (at least hypothetically speaking) to reproduce the same results?Reproducibility with random seeds We can do this by setting a random seed to any given number before we build and train our model. By setting a random seed, we're forcing the “random” initialization of the weights to be generated based upon the seed we set.
partial_fit is a handy API that can be used to perform incremental learning in a mini-batch of an out-of-memory dataset. The primary purpose of using warm_state is to reducing training time when fitting the same dataset with different sets of hyperparameter values.
Defining random seed will make sure that every time you run the algorithm, the random will generate the same numbers. IMHO, the result will always be the same as long as we use the same data, and the same values of any other parameters.
As you have read in sklearn's FAQ, it is the same either you define it globally by numpy.random.seed()
or by set random_state
parameter in all algorithms involved, provided that you set the same number for both cases.
I take example from sklearn docs, to illustrate it.
import numpy
from sklearn.model_selection import train_test_split
# numpy.random.seed(42)
X, y = np.arange(10).reshape((5, 2)), range(5)
#1 running this many times, Xtr will remain [[4, 5],[0, 1],[6, 7]].
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.33, random_state=42)
#2 try running this line many times, you will get various Xtr
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.33)
Now uncomment the third line. Run #2 many times. Xtr
will always be [[4, 5],[0, 1],[6, 7]]
By numpy.random.seed()
, it sets seed to default (None) and then it will try to read data from /dev/urandom (or the Windows analogue) if available or seed from the clock otherwise. docs
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With