I made a random forest model using python's sklearn package where I set the seed to for example to 1234
. To productionise models, we use pyspark. If I was to pass the same hyperparmeters and same seed value, i.e. 1234
, will it get the same results?
Basically, do random seed numbers work between different systems?
A random seed is a starting point in generating random numbers. A random seed specifies the start point when a computer generates a random number sequence. This can be any number, but it usually comes from seconds on a computer system's clock (Henkemans & Lee, 2001).
❓ What is a Random Seed? A random seed is used to ensure that results are reproducible. In other words, using this parameter makes sure that anyone who re-runs your code will get the exact same outputs. Reproducibility is an extremely important concept in data science and other fields.
The random number generators in numpy. random and random have totally separate internal states, so numpy. random. seed() will not affect the random sequences produced by random.
The seed is the point of this extremely long sequence where the generator starts. So yes, it is deterministic. A pseudo-random number generator is an endlessly repeating fixed list of numbers.
Well, this is exactly the kind of question that could really do with some experiments & code snippets provided...
Anyway, it seems that the general answer is a firm no: not only between Python and Spark MLlib, but even between Spark sub-modules, or between Python & Numpy...
Here is some reproducible code, run in the Databricks community cloud (where pyspark
is already imported & the relevant contexts initialized):
import sys
import random
import pandas as pd
import numpy as np
from pyspark.sql.functions import rand, randn
from pyspark.mllib import random as r # avoid conflict with native Python random module
print("Spark version " + spark.version)
print("Python version %s.%s.%s" % sys.version_info[:3])
print("Numpy version " + np.version.version)
# Spark version 2.3.1
# Python version 3.5.2
# Numpy version 1.11.1
s = 1234 # RNG seed
# Spark SQL random module:
spark_df = sqlContext.range(0, 10)
spark_df = spark_df.select("id", randn(seed=s).alias("normal"), rand(seed=s).alias("uniform"))
# Python 3 random module:
random.seed(s)
x = [random.uniform(0,1) for i in range(10)] # random.rand() gives exact same results
random.seed(s)
y = [random.normalvariate(0,1) for i in range(10)]
df = pd.DataFrame({'uniform':x, 'normal':y})
# numpy random module
np.random.seed(s)
xx = np.random.uniform(size=10) # again, np.random.rand(10) gives exact same results
np.random.seed(s)
yy = np.random.randn(10)
numpy_df = pd.DataFrame({'uniform':xx, 'normal':yy})
# Spark MLlib random module
rdd_uniform = r.RandomRDDs.uniformRDD(sc, 10, seed=s).collect()
rdd_normal = r.RandomRDDs.normalRDD(sc, 10, seed=s).collect()
rdd_df = pd.DataFrame({'uniform':rdd_uniform, 'normal':rdd_normal})
And here are the results:
Native Python 3:
# df
normal uniform
0 1.430825 0.966454
1 1.803801 0.440733
2 0.321290 0.007491
3 0.599006 0.910976
4 -0.700891 0.939269
5 0.233350 0.582228
6 -0.613906 0.671563
7 -1.622382 0.083938
8 0.131975 0.766481
9 0.191054 0.236810
Numpy:
# numpy_df
normal uniform
0 0.471435 0.191519
1 -1.190976 0.622109
2 1.432707 0.437728
3 -0.312652 0.785359
4 -0.720589 0.779976
5 0.887163 0.272593
6 0.859588 0.276464
7 -0.636524 0.801872
8 0.015696 0.958139
9 -2.242685 0.875933
Spark SQL:
# spark_df.show()
+---+--------------------+-------------------+
| id| normal| uniform|
+---+--------------------+-------------------+
| 0| 0.9707422835368164| 0.9499610869333489|
| 1| 0.3641589200870126| 0.9682554532421536|
| 2|-0.22282955491417034|0.20293463923130883|
| 3|-0.00607734375219...|0.49540111648680385|
| 4| -0.603246393509015|0.04350782074761239|
| 5|-0.12066287904491797|0.09390549680302918|
| 6| 0.2899567922101867| 0.6789838400775526|
| 7| 0.5827830892516723| 0.6560703836291193|
| 8| 1.351649207673346| 0.7750229279150739|
| 9| 0.5286035772104091| 0.6075560897646175|
+---+--------------------+-------------------+
Spark MLlib:
# rdd_df
normal uniform
0 -0.957840 0.259282
1 0.742598 0.674052
2 0.225768 0.707127
3 1.109644 0.850683
4 -0.269745 0.414752
5 -0.148916 0.494394
6 0.172857 0.724337
7 -0.276485 0.252977
8 -0.963518 0.356758
9 1.366452 0.703145
Of course, even if the above results were identical, this would be no guarantee that results from, say, Random Forest in scikit-learn, would be exactly identical to the results of pyspark Random Forest...
Despite the negative answer, I really cannot see how that affects the deployment of any ML system, i.e. if the results depend crucially on the RNG, then something is definitely not right...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With