I have a DataFrame with columns a, b for which I want to partition the data by a using a window function, and then give unique indices for b
val window_filter = Window.partitionBy($"a").orderBy($"b".desc)
withColumn("uid", row_number().over(window_filter))
But for this use-case, ordering by b is unneeded and may be time consuming. How can I achieve this without ordering?
row_number() without order by or with order by constant has non-deterministic behavior and may produce different results for the same rows from run to run due to parallel processing. The same may happen if the order by column does not change, the order of rows may be different from run to run and you will get different results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With