Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating a row number of each row in PySpark DataFrame using row_number() function with Spark version 2.2

I am having a PySpark DataFrame -

valuesCol = [('Sweden',31),('Norway',62),('Iceland',13),('Finland',24),('Denmark',52)]
df = sqlContext.createDataFrame(valuesCol,['name','id'])
+-------+---+
|   name| id|
+-------+---+
| Sweden| 31|
| Norway| 62|
|Iceland| 13|
|Finland| 24|
|Denmark| 52|
+-------+---+

I wish to add a row column to this DataFrame, which is the row number (serial number) of the row, like shown below -

My final output should be:

+-------+---+--------+
|   name| id|row_num |
+-------+---+--------+
| Sweden| 31|       1|
| Norway| 62|       2|
|Iceland| 13|       3|
|Finland| 24|       4|
|Denmark| 52|       5|
+-------+---+--------+

My Spark version is 2.2

I am trying this code, but it doesn't work -

from pyspark.sql.functions import row_number
from pyspark.sql.window import Window
w = Window().orderBy()
df = df.withColumn("row_num", row_number().over(w))
df.show()

I am getting an Error:

AnalysisException: 'Window function row_number() requires window to be ordered, please add ORDER BY clause. For example SELECT row_number()(value_expr) OVER (PARTITION BY window_partition ORDER BY window_ordering) from table;'

If I understand it correctly, I need to order some column, but I don't want something like this w = Window().orderBy('id') because that will reorder the entire DataFrame.

Can anyone suggest how to achieve the above mentioned output using row_number() function?

like image 486
cph_sto Avatar asked Oct 29 '18 09:10

cph_sto


2 Answers

You should define column for order clause. If you don't need to order values then write a dummy value. Try below;

from pyspark.sql.functions import row_number,lit
from pyspark.sql.window import Window
w = Window().orderBy(lit('A'))
df = df.withColumn("row_num", row_number().over(w))
like image 98
Ali Yesilli Avatar answered Oct 16 '22 08:10

Ali Yesilli


I had a similar problem, but in my case @Ali Yesilli's solution failed, because I was reading multiple input files separately and ultimately unioning them all in a single dataframe. In this case, the order within the window ordered by a dummy variable proved to be unpredictable.

So to achieve more robust ordering, I used monotonically_increasing_id:

df = df.withColumn('original_order', monotonically_increasing_id())
df = df.withColumn('row_num', row_number().over(Window.orderBy('original_order')))
df = df.drop('original_order')
like image 2
Waiski Avatar answered Oct 16 '22 08:10

Waiski