Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are random seeds compatible between systems?

I made a random forest model using python's sklearn package where I set the seed to for example to 1234. To productionise models, we use pyspark. If I was to pass the same hyperparmeters and same seed value, i.e. 1234, will it get the same results?

Basically, do random seed numbers work between different systems?

like image 876
Auren Ferguson Avatar asked Sep 12 '18 11:09

Auren Ferguson


People also ask

How do random number seeds work?

A random seed is a starting point in generating random numbers. A random seed specifies the start point when a computer generates a random number sequence. This can be any number, but it usually comes from seconds on a computer system's clock (Henkemans & Lee, 2001).

What is a random seed and how do they allow for sampling reproducibility?

❓ What is a Random Seed? A random seed is used to ensure that results are reproducible. In other words, using this parameter makes sure that anyone who re-runs your code will get the exact same outputs. Reproducibility is an extremely important concept in data science and other fields.

Does random seed affect NP random?

The random number generators in numpy. random and random have totally separate internal states, so numpy. random. seed() will not affect the random sequences produced by random.

Is random seed deterministic?

The seed is the point of this extremely long sequence where the generator starts. So yes, it is deterministic. A pseudo-random number generator is an endlessly repeating fixed list of numbers.


1 Answers

Well, this is exactly the kind of question that could really do with some experiments & code snippets provided...

Anyway, it seems that the general answer is a firm no: not only between Python and Spark MLlib, but even between Spark sub-modules, or between Python & Numpy...

Here is some reproducible code, run in the Databricks community cloud (where pyspark is already imported & the relevant contexts initialized):

import sys

import random
import pandas as pd
import numpy as np
from pyspark.sql.functions import rand, randn
from pyspark.mllib import random as r  # avoid conflict with native Python random module

print("Spark version " + spark.version)
print("Python version %s.%s.%s" % sys.version_info[:3])
print("Numpy version " + np.version.version)

# Spark version 2.3.1 
# Python version 3.5.2 
# Numpy version 1.11.1

s = 1234 # RNG seed


# Spark SQL random module:
spark_df = sqlContext.range(0, 10)
spark_df = spark_df.select("id", randn(seed=s).alias("normal"), rand(seed=s).alias("uniform"))


# Python 3 random module:
random.seed(s)
x = [random.uniform(0,1) for i in range(10)] # random.rand() gives exact same results

random.seed(s)
y = [random.normalvariate(0,1) for i in range(10)]

df = pd.DataFrame({'uniform':x, 'normal':y})


# numpy random module
np.random.seed(s)
xx = np.random.uniform(size=10)  # again, np.random.rand(10) gives exact same results

np.random.seed(s)
yy = np.random.randn(10)

numpy_df = pd.DataFrame({'uniform':xx, 'normal':yy})


# Spark MLlib random module
rdd_uniform = r.RandomRDDs.uniformRDD(sc, 10, seed=s).collect()
rdd_normal = r.RandomRDDs.normalRDD(sc, 10, seed=s).collect()

rdd_df = pd.DataFrame({'uniform':rdd_uniform, 'normal':rdd_normal})

And here are the results:

Native Python 3:

# df

     normal  uniform
0  1.430825 0.966454
1  1.803801 0.440733 
2  0.321290 0.007491 
3  0.599006 0.910976 
4 -0.700891 0.939269 
5  0.233350 0.582228
6 -0.613906 0.671563
7 -1.622382 0.083938
8  0.131975 0.766481
9  0.191054 0.236810

Numpy:

# numpy_df

     normal  uniform
0  0.471435 0.191519
1 -1.190976 0.622109 
2  1.432707 0.437728
3 -0.312652 0.785359
4 -0.720589 0.779976
5  0.887163 0.272593
6  0.859588 0.276464 
7 -0.636524 0.801872 
8  0.015696 0.958139
9 -2.242685 0.875933

Spark SQL:

# spark_df.show()

+---+--------------------+-------------------+ 
| id|              normal|            uniform|
+---+--------------------+-------------------+
|  0|  0.9707422835368164| 0.9499610869333489| 
|  1|  0.3641589200870126| 0.9682554532421536|
|  2|-0.22282955491417034|0.20293463923130883|
|  3|-0.00607734375219...|0.49540111648680385|
|  4|  -0.603246393509015|0.04350782074761239|
|  5|-0.12066287904491797|0.09390549680302918|
|  6|  0.2899567922101867| 0.6789838400775526|
|  7|  0.5827830892516723| 0.6560703836291193|
|  8|   1.351649207673346| 0.7750229279150739|
|  9|  0.5286035772104091| 0.6075560897646175|
+---+--------------------+-------------------+

Spark MLlib:

# rdd_df

     normal  uniform 
0 -0.957840 0.259282 
1  0.742598 0.674052 
2  0.225768 0.707127 
3  1.109644 0.850683 
4 -0.269745 0.414752 
5 -0.148916 0.494394 
6  0.172857 0.724337
7 -0.276485 0.252977
8 -0.963518 0.356758
9  1.366452 0.703145

Of course, even if the above results were identical, this would be no guarantee that results from, say, Random Forest in scikit-learn, would be exactly identical to the results of pyspark Random Forest...

Despite the negative answer, I really cannot see how that affects the deployment of any ML system, i.e. if the results depend crucially on the RNG, then something is definitely not right...

like image 100
desertnaut Avatar answered Oct 08 '22 09:10

desertnaut