Pyspark

Question

I have a pyspark dataframe that I want to add random values to in a repeated fashion to guarantee the same output. I've tried setting numpy.random.seed and random.seed, but each execution of the below code continues to generate different sequences of random values.

 +---+---+---+---+---+
 | 7 | 15| 19| 21| 27|
 +---------------+---+
 | 0 |  1|  0|  0|  0|
 | 0 |  0|  0|  1|  0|
 | 0 |  0|  0|  1|  0|
 | 2 |  0|  0|  0|  0|
 | 4 |  0|  3|  0|  0|
 | 5 |  0| 25|  0|  0| 
 | 6 |  2|  0|  0|  0|

Here's my current implementation:

import random
import numpy as np

#set seed
random.seed(1234)
np.random.seed(1234)

#create dataframe
df = sc.parallelize([
[ 0, 1, 0, 0, 0],
[ 0, 0, 0, 1, 0],
[ 0, 0, 0, 1, 0],
[2, 0, 0, 0, 0],
[0, 3, 0, 0, 0],
[ 0, 25,0, 0, 0],
[2, 0, 0, 0, 0],
]).toDF(('7', '15', '19', '21', '27'))

 random_df = data.select("*").rdd.map(
        lambda x, r=random: [Row(float(r.random() + row)) for row in x]).toDF(data.columns)

In my latest attempt at a solution above, I pass the reference to random into my lambda expression, but I still get different values with each execution despite setting the seed. Any thoughts or ideas on how to solve this challenge?

1.618 · Accepted Answer

from pyspark.sql.functions import col, rand
random_df = df.select(*((col(c) + rand(seed=1234)).alias(c) for c in df.columns))

Pyspark - set random seed for reproducible values

Tags:

random

apache-spark-sql

Brian Behe

Video Answer

1 Answers

1.618

Recent Activity

Donate For Us