I want to add a column with row number for the below dataframe, but keep the original order.
The existing dataframe:
+-—-+
|val|
+-—-+
|1.0|
+-—-+
|0.0|
+-—-+
|0.0|
+-—-+
|1.0|
+-—-+
|0.0|
+-—-+
My expected output:
+-—-+-—-+
|idx|val|
+-—-+-—-+
| 1|1.0|
+-—-+-—-+
| 2|0.0|
+-—-+-—-+
| 3|0.0|
+-—-+-—-+
| 4|1.0|
+-—-+-—-+
| 5|0.0|
+-—-+-—-+
I had tried many codes like the below:
from pyspark.sql.functions import row_number,lit
from pyspark.sql.window import Window
w = Window().orderBy(lit('A'))
df = df.withColumn("row_num", row_number().over(w))
Window.partitionBy("xxx").orderBy("yyy")
But the above code just only groupby
the value and set index, which will make my df not in order.
Can we just add one column without changing the order?
There's no such thing as order in Apache Spark, it is a distributed system where data is divided into smaller chunks called partitions, each operation will be applied to these partitions, the creation of partitions is random, so you will not be able to preserve order unless you specified in your orderBy() clause, so if you need to keep order you need to specify which column will be used to keep order.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With