Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to Sort a Dataframe in Pyspark [duplicate]

I have a dataframe:

# +---+--------+---------+
# | id|  rank  |  value  |
# +---+--------+---------+
# |  1|    A   |    10   |
# |  2|    B   |    46   |
# |  3|    D   |     8   |
# |  4|    C   |     8   |
# +---+--------+---------+

I want to sort it by value, then rank. This seems like it should be simple, but I'm not seeing how it's done in the documentation or SO for Pyspark, only for R and Scala.

This is how it should look after sorting, .show() should print:

# +---+--------+---------+
# | id|  rank  |  value  |
# +---+--------+---------+
# |  4|    C   |     8   |
# |  3|    D   |     8   |
# |  1|    A   |    10   |
# |  2|    B   |    46   |
# +---+--------+---------+
like image 963
Tibberzz Avatar asked Jun 12 '18 19:06

Tibberzz


2 Answers

df.orderBy(["value", "rank"], ascending=[1, 1])

Reference: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.orderBy

like image 123
gasparms Avatar answered Nov 07 '22 19:11

gasparms


say your dataframe is stored in a variable called df you'd do df.orderBy('value').show() to get it sorted

like image 25
Arnon Rotem-Gal-Oz Avatar answered Nov 07 '22 18:11

Arnon Rotem-Gal-Oz