Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I get deterministic random ordering in pyspark?

Tags:

pyspark

I would like to randomly order a dataframe, but in a deterministic way. I thought that the way to do this was to use orderBy with a seeded rand function. However, I found that this is non-deterministic across different machines. For example, consider the following code:

from pyspark.sql import types as T, functions as F
df = spark.createDataFrame(range(10), T.IntegerType())
df = df.orderBy(F.rand(seed=123))
print(df.show())

When I run this on my local machine, it prints

+-----+
|value|
+-----+
|    3|
|    4|
|    9|
|    7|
|    8|
|    0|
|    5|
|    6|
|    2|
|    1|
+-----+

but on an EC2 instance, it prints

+-----+
|value|
+-----+
|    9|
|    5|
|    6|
|    7|
|    0|
|    1|
|    4|
|    8|
|    3|
|    2|
+-----+

How can I get a random ordering that is deterministic, even when running on different machines?

My pyspark version is 2.4.1

EDIT: By the way, I should add that just doing df.select(F.rand(seed=123)).show() produces the same output across both machines, so this is specifically a problem with the combination of orderBy and rand.

like image 957
Isaac Avatar asked Apr 02 '19 07:04

Isaac


1 Answers

Thank you for the additional information from your edit! That turned out to be a pretty important clue.

Problem

I think the problem here is that you are attaching a pseudorandomly-generated column to an already-randomly-ordered data set, and the existing randomness is not deterministic, so attaching another source of randomness that is deterministic doesn't help.

You can verify this by rephrasing your orderBy call like:

df.withColumn('order', F.rand(seed=123)).orderBy(F.col('order').asc())

If I'm right, you'll see the same random values on both machines, but they'll be attached to different rows: the order in which the random values attach to rows is random!

Solution

And if that's true, the solution should be pretty straightforward: apply deterministic, non-random ordering over "real" values, before applying a random (but still deterministic) order on top.

df.orderBy(F.col('value').asc()).withColumn('order', F.rand(seed=123)).orderBy(F.col('order').asc())

should produce similar output on both machines. My result:

+-----+-------------------+
|value|              order|
+-----+-------------------+
|    4|0.13617504799810343|
|    5|0.13778573503201175|
|    6|0.15367835411103337|
|    9|0.43774287147238644|
|    0| 0.5029534413816527|
|    1| 0.5230701153994686|
|    7|  0.572063607751534|
|    8| 0.7689696831405166|
|    3|   0.82540915099773|
|    2| 0.8535692890157796|
+-----+-------------------+
like image 64
Jesse Amano Avatar answered Jan 01 '23 09:01

Jesse Amano