I have a dataframe that looks like: <pre class="prettyprint"><code>A B C --------------- A1 B1 0.8 A1 B2 0.55 A1 B3 0.43 A2 B1 0.7 A2 B2 0.5 A2 B3 0.5 A3 B1 0.2 A3 B2 0.3 A3 B3 0.4 </code></pre> How do I convert the column 'C' to the relative rank(higher score->better rank) per column A? Expected Output: <pre class="prettyprint"><code>A B Rank --------------- A1 B1 1 A1 B2 2 A1 B3 3 A2 B1 1 A2 B2 2 A2 B3 2 A3 B1 3 A3 B2 2 A3 B3 1 </code></pre> The ultimate state I want to reach is to aggregate column B and store the ranks for each A: Example: <pre class="prettyprint"><code>B Ranks B1 [1,1,3] B2 [2,2,2] B3 [3,2,1] </code></pre>

Add rank: <pre class="prettyprint"><code>from pyspark.sql.functions import * from pyspark.sql.window import Window ranked = df.withColumn( "rank", dense_rank().over(Window.partitionBy("A").orderBy(desc("C")))) </code></pre> Group by: <pre class="prettyprint"><code>grouped = ranked.groupBy("B").agg(collect_list(struct("A", "rank")).alias("tmp")) </code></pre> Sort and select: <pre class="prettyprint"><code>grouped.select("B", sort_array("tmp")["rank"].alias("ranks")) </code></pre> Tested with Spark 2.1.0.

Group By, Rank and aggregate spark data frame using pyspark

Tags:

I have a dataframe that looks like:

A     B    C --------------- A1    B1   0.8 A1    B2   0.55 A1    B3   0.43  A2    B1   0.7 A2    B2   0.5 A2    B3   0.5  A3    B1   0.2 A3    B2   0.3 A3    B3   0.4

How do I convert the column 'C' to the relative rank(higher score->better rank) per column A? Expected Output:

A     B    Rank --------------- A1    B1   1 A1    B2   2 A1    B3   3  A2    B1   1 A2    B2   2 A2    B3   2  A3    B1   3 A3    B2   2 A3    B3   1

The ultimate state I want to reach is to aggregate column B and store the ranks for each A:

Example:

B    Ranks B1   [1,1,3] B2   [2,2,2] B3   [3,2,1]

547

asked Jan 15 '17 12:01

futurenext110

2 Answers

Add rank:

from pyspark.sql.functions import * from pyspark.sql.window import Window  ranked =  df.withColumn(   "rank", dense_rank().over(Window.partitionBy("A").orderBy(desc("C"))))

Group by:

grouped = ranked.groupBy("B").agg(collect_list(struct("A", "rank")).alias("tmp"))

Sort and select:

grouped.select("B", sort_array("tmp")["rank"].alias("ranks"))

Tested with Spark 2.1.0.

183

answered Sep 21 '22 19:09

user7337271

windowSpec = Window.partitionBy("col1").orderBy("col2") ranked = demand.withColumn("col_rank", row_number().over(windowSpec)) ranked.show(1000)

answered Sep 20 '22 19:09

Laxman Jeergal

Related questions
                            
                                Sending GET request parameters in body
                            
                                APKs to retain in Google Play Developer Console
                            
                                SQL Query Where Date = Today Minus 7 Days
                            
                                How can I tell prettier to parse files recursively?
                            
                                What's the difference between .npmignore and .gitignore?
                            
                                Firebase: Checking for 'write or delete' for onWrite Events
                            
                                Change nltk.download() path directory from default ~/ntlk_data
                            
                                ScrollView child layout must be applied through the contentContainerStyle prop
                            
                                Nested Resources w/ Rails 5.1 form_with
                            
                                iPhone X keyboard appear showing extra space
                            
                                What exactly is namespacing of modules in vuex
                            
                                Difference between module and component in Dagger2

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With