I have a pyspark dataframe that I want to add random values to in a repeated fashion to guarantee the same output. I've tried setting numpy.random.seed and random.seed, but each execution of the below code continues to generate different sequences of random values.
+---+---+---+---+---+
| 7 | 15| 19| 21| 27|
+---------------+---+
| 0 | 1| 0| 0| 0|
| 0 | 0| 0| 1| 0|
| 0 | 0| 0| 1| 0|
| 2 | 0| 0| 0| 0|
| 4 | 0| 3| 0| 0|
| 5 | 0| 25| 0| 0|
| 6 | 2| 0| 0| 0|
Here's my current implementation:
import random
import numpy as np
#set seed
random.seed(1234)
np.random.seed(1234)
#create dataframe
df = sc.parallelize([
[ 0, 1, 0, 0, 0],
[ 0, 0, 0, 1, 0],
[ 0, 0, 0, 1, 0],
[2, 0, 0, 0, 0],
[0, 3, 0, 0, 0],
[ 0, 25,0, 0, 0],
[2, 0, 0, 0, 0],
]).toDF(('7', '15', '19', '21', '27'))
random_df = data.select("*").rdd.map(
lambda x, r=random: [Row(float(r.random() + row)) for row in x]).toDF(data.columns)
In my latest attempt at a solution above, I pass the reference to random into my lambda expression, but I still get different values with each execution despite setting the seed. Any thoughts or ideas on how to solve this challenge?
from pyspark.sql.functions import col, rand
random_df = df.select(*((col(c) + rand(seed=1234)).alias(c) for c in df.columns))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With