Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get absolutely reproducible results with Scikit Learn?

Regarding the seeding system when running machine learning algorithms with Scikit-Learn, there are three different things usually mentioned:

  • random.seed
  • np.random.seed
  • random_state at SkLearn (cross-validation iterators, ML algorithms etc)

I have already in my mind this FAQ of SkLearn about how to fix the global seeding system and articles which point out that this should not be simply a FAQ.

My ultimate question is how can I get absolutely reproducible results when running an ML algorithm with SkLearn?

In more detail,

  • If I only use np.random.seed and do not specify any random_state at SkLearn then will my results be absolutely reproducible?

and one question at least for the sake of knowledge:

  • How exactly np.random.seed and random_stateof SkLearn are internally related? How np.random.seed affects the seeding system (random_state) of SkLearn and makes it (at least hypothetically speaking) to reproduce the same results?
like image 234
Outcast Avatar asked Oct 10 '18 18:10

Outcast


People also ask

How can I get reproducible results in keras?

Reproducibility with random seeds We can do this by setting a random seed to any given number before we build and train our model. By setting a random seed, we're forcing the “random” initialization of the weights to be generated based upon the seed we set.

What is partial fit in Sklearn?

partial_fit is a handy API that can be used to perform incremental learning in a mini-batch of an out-of-memory dataset. The primary purpose of using warm_state is to reducing training time when fitting the same dataset with different sets of hyperparameter values.


1 Answers

Defining random seed will make sure that every time you run the algorithm, the random will generate the same numbers. IMHO, the result will always be the same as long as we use the same data, and the same values of any other parameters.

As you have read in sklearn's FAQ, it is the same either you define it globally by numpy.random.seed() or by set random_state parameter in all algorithms involved, provided that you set the same number for both cases.

I take example from sklearn docs, to illustrate it.

import numpy
from sklearn.model_selection import train_test_split
# numpy.random.seed(42)
X, y = np.arange(10).reshape((5, 2)), range(5)

#1 running this many times, Xtr will remain [[4, 5],[0, 1],[6, 7]].
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.33, random_state=42)

#2 try running this line many times, you will get various Xtr
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.33)

Now uncomment the third line. Run #2 many times. Xtr will always be [[4, 5],[0, 1],[6, 7]]

By numpy.random.seed(), it sets seed to default (None) and then it will try to read data from /dev/urandom (or the Windows analogue) if available or seed from the clock otherwise. docs

like image 116
ipramusinto Avatar answered Oct 17 '22 16:10

ipramusinto