Are random seeds compatible between systems?

Tags:

I made a random forest model using python's sklearn package where I set the seed to for example to 1234. To productionise models, we use pyspark. If I was to pass the same hyperparmeters and same seed value, i.e. 1234, will it get the same results?

Basically, do random seed numbers work between different systems?

876

asked Sep 12 '18 11:09

Auren Ferguson

1 Answers

Well, this is exactly the kind of question that could really do with some experiments & code snippets provided...

Anyway, it seems that the general answer is a firm no: not only between Python and Spark MLlib, but even between Spark sub-modules, or between Python & Numpy...

Here is some reproducible code, run in the Databricks community cloud (where pyspark is already imported & the relevant contexts initialized):

import sys

import random
import pandas as pd
import numpy as np
from pyspark.sql.functions import rand, randn
from pyspark.mllib import random as r  # avoid conflict with native Python random module

print("Spark version " + spark.version)
print("Python version %s.%s.%s" % sys.version_info[:3])
print("Numpy version " + np.version.version)

# Spark version 2.3.1 
# Python version 3.5.2 
# Numpy version 1.11.1

s = 1234 # RNG seed


# Spark SQL random module:
spark_df = sqlContext.range(0, 10)
spark_df = spark_df.select("id", randn(seed=s).alias("normal"), rand(seed=s).alias("uniform"))


# Python 3 random module:
random.seed(s)
x = [random.uniform(0,1) for i in range(10)] # random.rand() gives exact same results

random.seed(s)
y = [random.normalvariate(0,1) for i in range(10)]

df = pd.DataFrame({'uniform':x, 'normal':y})


# numpy random module
np.random.seed(s)
xx = np.random.uniform(size=10)  # again, np.random.rand(10) gives exact same results

np.random.seed(s)
yy = np.random.randn(10)

numpy_df = pd.DataFrame({'uniform':xx, 'normal':yy})


# Spark MLlib random module
rdd_uniform = r.RandomRDDs.uniformRDD(sc, 10, seed=s).collect()
rdd_normal = r.RandomRDDs.normalRDD(sc, 10, seed=s).collect()

rdd_df = pd.DataFrame({'uniform':rdd_uniform, 'normal':rdd_normal})

And here are the results:

Native Python 3:

# df

     normal  uniform
0  1.430825 0.966454
1  1.803801 0.440733 
2  0.321290 0.007491 
3  0.599006 0.910976 
4 -0.700891 0.939269 
5  0.233350 0.582228
6 -0.613906 0.671563
7 -1.622382 0.083938
8  0.131975 0.766481
9  0.191054 0.236810

Numpy:

# numpy_df

     normal  uniform
0  0.471435 0.191519
1 -1.190976 0.622109 
2  1.432707 0.437728
3 -0.312652 0.785359
4 -0.720589 0.779976
5  0.887163 0.272593
6  0.859588 0.276464 
7 -0.636524 0.801872 
8  0.015696 0.958139
9 -2.242685 0.875933

Spark SQL:

# spark_df.show()

+---+--------------------+-------------------+ 
| id|              normal|            uniform|
+---+--------------------+-------------------+
|  0|  0.9707422835368164| 0.9499610869333489| 
|  1|  0.3641589200870126| 0.9682554532421536|
|  2|-0.22282955491417034|0.20293463923130883|
|  3|-0.00607734375219...|0.49540111648680385|
|  4|  -0.603246393509015|0.04350782074761239|
|  5|-0.12066287904491797|0.09390549680302918|
|  6|  0.2899567922101867| 0.6789838400775526|
|  7|  0.5827830892516723| 0.6560703836291193|
|  8|   1.351649207673346| 0.7750229279150739|
|  9|  0.5286035772104091| 0.6075560897646175|
+---+--------------------+-------------------+

Spark MLlib:

# rdd_df

     normal  uniform 
0 -0.957840 0.259282 
1  0.742598 0.674052 
2  0.225768 0.707127 
3  1.109644 0.850683 
4 -0.269745 0.414752 
5 -0.148916 0.494394 
6  0.172857 0.724337
7 -0.276485 0.252977
8 -0.963518 0.356758
9  1.366452 0.703145

Of course, even if the above results were identical, this would be no guarantee that results from, say, Random Forest in scikit-learn, would be exactly identical to the results of pyspark Random Forest...

Despite the negative answer, I really cannot see how that affects the deployment of any ML system, i.e. if the results depend crucially on the RNG, then something is definitely not right...

100

answered Oct 08 '22 09:10

desertnaut

Related questions
                            
                                Where is this warning being raised 'QApplication: invalid style override passed, ignoring it.'?
                            
                                Django JSONField filtering Queryset
                            
                                Python: Hello world with Flask gives me an error related to app.run(debug=True) [duplicate]
                            
                                How to use Vectorization with NumPy arrays to calculate geodesic distance using Geopy library for a large dataset?
                            
                                How to install python packages in a Google Dataproc cluster
                            
                                Python Speech recognition produces bad results
                            
                                How is Nesterov's Accelerated Gradient Descent implemented in Tensorflow?
                            
                                Creating Hypertables through SQL Alchemy
                            
                                Randomly select values from list but with character length restriction
                            
                                Change default location log file generated by logger in python
                            
                                Django - form_valid() vs save()
                            
                                Convert a black and white image to array of numbers?
                            
                                Selenium webdriver: firefox headless inject javascript to modify browser property
                            
                                Efficient Method of finding common files between two given paths in Python
                            
                                Asyncio How do you use run_forever?
                            
                                Pandas Merge two rows into a single row based on columns
                            
                                don't understand this lambda expression with defaultdict
                            
                                Most elegant way to assign multiple variables to the same value?
                            
                                Why can tf.image.decode_jpeg decode a png?
                            
                                jinja2.exceptions.TemplateSyntaxError: expected token 'end of print statement', got 'posted'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Are random seeds compatible between systems?

Tags:

python

random

scikit-learn

pyspark

apache-spark-mllib

Auren Ferguson

People also ask

1 Answers

desertnaut

Recent Activity

Donate For Us