Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do we rank dataframe?

I have sample dataframe as below :

i/p

accountNumber   assetValue  
A100            1000         
A100            500          
B100            600          
B100            200          

o/p

AccountNumber   assetValue  Rank
A100            1000         1
A100            500          2
B100            600          1
B100            200          2

Now my question is how do we add this rank column on dataframe which is sorted by account number. I am not expecting huge volume of rows so open to idea if I need to do it outside of dataframe.

I am using Spark version 1.5 and SQLContext hence cannot use Windows function

like image 769
user3293666 Avatar asked Mar 23 '17 03:03

user3293666


People also ask

How do you rank up in pandas?

To rank the rows of Pandas DataFrame we can use the DataFrame. rank() method which returns a rank of every respective index of a series passed. The rank is returned on the basis of position after sorting.

How do you rank rows in a DataFrame in Python?

For assigning the rank to our dataframe's elements, we use a built-in function of the pandas library that is the . rank() function. We pass the criteria based on which we are ranking the elements to it, and this function returns a new column in each row where the ranking is stored.

How do you rank a DataFrame in R?

The ranking of a variable in an R data frame can be done by using rank function. For example, if we have a data frame df that contains column x then rank of values in x can be found as rank(df$x).

How do you rank dates in Python?

sort_values(['date']) for sorting the dates for each user.


2 Answers

You can use row_number function and Window expression with which you can specify the partition and order columns:

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number

val df = Seq(("A100", 1000), ("A100", 500), ("B100", 600), ("B100", 200)).toDF("accountNumber", "assetValue")

df.withColumn("rank", row_number().over(Window.partitionBy($"accountNumber").orderBy($"assetValue".desc))).show

+-------------+----------+----+
|accountNumber|assetValue|rank|
+-------------+----------+----+
|         A100|      1000|   1|
|         A100|       500|   2|
|         B100|       600|   1|
|         B100|       200|   2|
+-------------+----------+----+
like image 110
Psidom Avatar answered Oct 25 '22 04:10

Psidom


Raw SQL:

val df = sc.parallelize(Seq(
  ("A100", 1000), ("A100", 500), ("B100", 600), ("B100", 200)
)).toDF("accountNumber", "assetValue")

df.registerTempTable("df")
sqlContext.sql("SELECT accountNumber,assetValue, RANK() OVER (partition by accountNumber ORDER BY assetValue desc) AS rank FROM df").show


+-------------+----------+----+
|accountNumber|assetValue|rank|
+-------------+----------+----+
|         A100|      1000|   1|
|         A100|       500|   2|
|         B100|       600|   1|
|         B100|       200|   2|
+-------------+----------+----+
like image 36
Nayan Sharma Avatar answered Oct 25 '22 04:10

Nayan Sharma