I have sample dataframe as below :
i/p
accountNumber assetValue
A100 1000
A100 500
B100 600
B100 200
o/p
AccountNumber assetValue Rank
A100 1000 1
A100 500 2
B100 600 1
B100 200 2
Now my question is how do we add this rank column on dataframe which is sorted by account number. I am not expecting huge volume of rows so open to idea if I need to do it outside of dataframe.
I am using Spark version 1.5 and SQLContext hence cannot use Windows function
To rank the rows of Pandas DataFrame we can use the DataFrame. rank() method which returns a rank of every respective index of a series passed. The rank is returned on the basis of position after sorting.
For assigning the rank to our dataframe's elements, we use a built-in function of the pandas library that is the . rank() function. We pass the criteria based on which we are ranking the elements to it, and this function returns a new column in each row where the ranking is stored.
The ranking of a variable in an R data frame can be done by using rank function. For example, if we have a data frame df that contains column x then rank of values in x can be found as rank(df$x).
sort_values(['date']) for sorting the dates for each user.
You can use row_number
function and Window
expression with which you can specify the partition
and order
columns:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number
val df = Seq(("A100", 1000), ("A100", 500), ("B100", 600), ("B100", 200)).toDF("accountNumber", "assetValue")
df.withColumn("rank", row_number().over(Window.partitionBy($"accountNumber").orderBy($"assetValue".desc))).show
+-------------+----------+----+
|accountNumber|assetValue|rank|
+-------------+----------+----+
| A100| 1000| 1|
| A100| 500| 2|
| B100| 600| 1|
| B100| 200| 2|
+-------------+----------+----+
Raw SQL:
val df = sc.parallelize(Seq(
("A100", 1000), ("A100", 500), ("B100", 600), ("B100", 200)
)).toDF("accountNumber", "assetValue")
df.registerTempTable("df")
sqlContext.sql("SELECT accountNumber,assetValue, RANK() OVER (partition by accountNumber ORDER BY assetValue desc) AS rank FROM df").show
+-------------+----------+----+
|accountNumber|assetValue|rank|
+-------------+----------+----+
| A100| 1000| 1|
| A100| 500| 2|
| B100| 600| 1|
| B100| 200| 2|
+-------------+----------+----+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With