Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark - Add a new column with a Rank by User

I have this PySpark DataFrame

df = pd.DataFrame(np.array([
    ["[email protected]",2,3], ["[email protected]",5,5],
    ["[email protected]",8,2], ["[email protected]",9,3]
]), columns=['user','movie','rating'])

sparkdf = sqlContext.createDataFrame(df, samplingRatio=0.1)
         user movie rating
[email protected]     2      3
[email protected]     5      5
[email protected]     8      2
[email protected]     9      3

I need to add a new column with a Rank by User

I want have this output

         user  movie rating  Rank
[email protected]     2      3     1
[email protected]     5      5     1
[email protected]     8      2     2
[email protected]     9      3     3

How can I do that?

like image 431
Kardu Avatar asked Apr 13 '16 17:04

Kardu


1 Answers

There is really no elegant solution here as for now. If you have to you can try something like this:

lookup = (sparkdf.select("user")
    .distinct()
    .orderBy("user")
    .rdd
    .zipWithIndex()
    .map(lambda x: x[0] + (x[1], ))
    .toDF(["user", "rank"]))

sparkdf.join(lookup, ["user"]).withColumn("rank", col("rank") + 1)

Window functions alternative is much more concise:

from pyspark.sql.functions import dense_rank

sparkdf.withColumn("rank", dense_rank().over(w))

but it is extremely inefficient and should be avoided in practice.

like image 83
zero323 Avatar answered Oct 09 '22 06:10

zero323