Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark - set random seed for reproducible values

I have a pyspark dataframe that I want to add random values to in a repeated fashion to guarantee the same output. I've tried setting numpy.random.seed and random.seed, but each execution of the below code continues to generate different sequences of random values.

 +---+---+---+---+---+
 | 7 | 15| 19| 21| 27|
 +---------------+---+
 | 0 |  1|  0|  0|  0|
 | 0 |  0|  0|  1|  0|
 | 0 |  0|  0|  1|  0|
 | 2 |  0|  0|  0|  0|
 | 4 |  0|  3|  0|  0|
 | 5 |  0| 25|  0|  0| 
 | 6 |  2|  0|  0|  0| 

Here's my current implementation:

import random
import numpy as np

#set seed
random.seed(1234)
np.random.seed(1234)

#create dataframe
df = sc.parallelize([
[ 0, 1, 0, 0, 0],
[ 0, 0, 0, 1, 0],
[ 0, 0, 0, 1, 0],
[2, 0, 0, 0, 0],
[0, 3, 0, 0, 0],
[ 0, 25,0, 0, 0],
[2, 0, 0, 0, 0],
]).toDF(('7', '15', '19', '21', '27'))

 random_df = data.select("*").rdd.map(
        lambda x, r=random: [Row(float(r.random() + row)) for row in x]).toDF(data.columns)

In my latest attempt at a solution above, I pass the reference to random into my lambda expression, but I still get different values with each execution despite setting the seed. Any thoughts or ideas on how to solve this challenge?

like image 346
Brian Behe Avatar asked Sep 03 '17 21:09

Brian Behe


Video Answer


1 Answers

from pyspark.sql.functions import col, rand
random_df = df.select(*((col(c) + rand(seed=1234)).alias(c) for c in df.columns))
like image 133
1.618 Avatar answered Oct 14 '22 06:10

1.618