Pyspark how to add row number in dataframe without changing the order?

Question

I want to add a column with row number for the below dataframe, but keep the original order.

The existing dataframe:

+-—-+
|val|
+-—-+
|1.0|
+-—-+
|0.0|
+-—-+
|0.0|
+-—-+
|1.0|
+-—-+
|0.0|
+-—-+

My expected output:

+-—-+-—-+
|idx|val|
+-—-+-—-+
|  1|1.0|
+-—-+-—-+
|  2|0.0|
+-—-+-—-+
|  3|0.0|
+-—-+-—-+
|  4|1.0|
+-—-+-—-+
|  5|0.0|
+-—-+-—-+

I had tried many codes like the below:

from pyspark.sql.functions import row_number,lit
from pyspark.sql.window import Window
w = Window().orderBy(lit('A'))
df = df.withColumn("row_num", row_number().over(w))

Window.partitionBy("xxx").orderBy("yyy")

But the above code just only groupby the value and set index, which will make my df not in order.

Can we just add one column without changing the order?

Abdennacer Lachiheb · Accepted Answer

There's no such thing as order in Apache Spark, it is a distributed system where data is divided into smaller chunks called partitions, each operation will be applied to these partitions, the creation of partitions is random, so you will not be able to preserve order unless you specified in your orderBy() clause, so if you need to keep order you need to specify which column will be used to keep order.

Pyspark how to add row number in dataframe without changing the order?

Tags:

python

dataframe

apache-spark

apache-spark-sql

pyspark

Jason Wong

1 Answers

Abdennacer Lachiheb

Recent Activity

Donate For Us

Pyspark how to add row number in dataframe without changing the order?

Tags:

python

dataframe

apache-spark

apache-spark-sql

pyspark

Jason Wong

1 Answers

Abdennacer Lachiheb

Related questions

Recent Activity

Donate For Us