Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark how to add row number in dataframe without changing the order?

I want to add a column with row number for the below dataframe, but keep the original order.

The existing dataframe:

+-—-+
|val|
+-—-+
|1.0|
+-—-+
|0.0|
+-—-+
|0.0|
+-—-+
|1.0|
+-—-+
|0.0|
+-—-+

My expected output:

+-—-+-—-+
|idx|val|
+-—-+-—-+
|  1|1.0|
+-—-+-—-+
|  2|0.0|
+-—-+-—-+
|  3|0.0|
+-—-+-—-+
|  4|1.0|
+-—-+-—-+
|  5|0.0|
+-—-+-—-+

I had tried many codes like the below:

from pyspark.sql.functions import row_number,lit
from pyspark.sql.window import Window
w = Window().orderBy(lit('A'))
df = df.withColumn("row_num", row_number().over(w))
Window.partitionBy("xxx").orderBy("yyy")

But the above code just only groupby the value and set index, which will make my df not in order.

Can we just add one column without changing the order?

like image 820
Jason Wong Avatar asked Sep 11 '25 03:09

Jason Wong


1 Answers

There's no such thing as order in Apache Spark, it is a distributed system where data is divided into smaller chunks called partitions, each operation will be applied to these partitions, the creation of partitions is random, so you will not be able to preserve order unless you specified in your orderBy() clause, so if you need to keep order you need to specify which column will be used to keep order.

like image 157
Abdennacer Lachiheb Avatar answered Sep 13 '25 18:09

Abdennacer Lachiheb