I would like to randomly order a dataframe, but in a deterministic way. I thought that the way to do this was to use orderBy
with a seeded rand
function. However, I found that this is non-deterministic across different machines. For example, consider the following code:
from pyspark.sql import types as T, functions as F
df = spark.createDataFrame(range(10), T.IntegerType())
df = df.orderBy(F.rand(seed=123))
print(df.show())
When I run this on my local machine, it prints
+-----+
|value|
+-----+
| 3|
| 4|
| 9|
| 7|
| 8|
| 0|
| 5|
| 6|
| 2|
| 1|
+-----+
but on an EC2 instance, it prints
+-----+
|value|
+-----+
| 9|
| 5|
| 6|
| 7|
| 0|
| 1|
| 4|
| 8|
| 3|
| 2|
+-----+
How can I get a random ordering that is deterministic, even when running on different machines?
My pyspark version is 2.4.1
EDIT: By the way, I should add that just doing df.select(F.rand(seed=123)).show()
produces the same output across both machines, so this is specifically a problem with the combination of orderBy
and rand
.
Thank you for the additional information from your edit! That turned out to be a pretty important clue.
I think the problem here is that you are attaching a pseudorandomly-generated column to an already-randomly-ordered data set, and the existing randomness is not deterministic, so attaching another source of randomness that is deterministic doesn't help.
You can verify this by rephrasing your orderBy
call like:
df.withColumn('order', F.rand(seed=123)).orderBy(F.col('order').asc())
If I'm right, you'll see the same random values on both machines, but they'll be attached to different rows: the order in which the random values attach to rows is random!
And if that's true, the solution should be pretty straightforward: apply deterministic, non-random ordering over "real" values, before applying a random (but still deterministic) order on top.
df.orderBy(F.col('value').asc()).withColumn('order', F.rand(seed=123)).orderBy(F.col('order').asc())
should produce similar output on both machines. My result:
+-----+-------------------+
|value| order|
+-----+-------------------+
| 4|0.13617504799810343|
| 5|0.13778573503201175|
| 6|0.15367835411103337|
| 9|0.43774287147238644|
| 0| 0.5029534413816527|
| 1| 0.5230701153994686|
| 7| 0.572063607751534|
| 8| 0.7689696831405166|
| 3| 0.82540915099773|
| 2| 0.8535692890157796|
+-----+-------------------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With