Retrieve top n in each group of a DataFrame in pyspark

Tags:

There's a DataFrame in pyspark with data as below:

user_id object_id score user_1  object_1  3 user_1  object_1  1 user_1  object_2  2 user_2  object_1  5 user_2  object_2  2 user_2  object_2  6

What I expect is returning 2 records in each group with the same user_id, which need to have the highest score. Consequently, the result should look as the following:

user_id object_id score user_1  object_1  3 user_1  object_2  2 user_2  object_2  6 user_2  object_1  5

I'm really new to pyspark, could anyone give me a code snippet or portal to the related documentation of this problem? Great thanks!

557

asked Jul 15 '16 13:07

KAs

1 Answers

I believe you need to use window functions to attain the rank of each row based on user_id and score, and subsequently filter your results to only keep the first two values.

from pyspark.sql.window import Window from pyspark.sql.functions import rank, col  window = Window.partitionBy(df['user_id']).orderBy(df['score'].desc())  df.select('*', rank().over(window).alias('rank'))    .filter(col('rank') <= 2)    .show()  #+-------+---------+-----+----+ #|user_id|object_id|score|rank| #+-------+---------+-----+----+ #| user_1| object_1|    3|   1| #| user_1| object_2|    2|   2| #| user_2| object_2|    6|   1| #| user_2| object_1|    5|   2| #+-------+---------+-----+----+

In general, the official programming guide is a good place to start learning Spark.

Data

rdd = sc.parallelize([("user_1",  "object_1",  3),                        ("user_1",  "object_2",  2),                        ("user_2",  "object_1",  5),                        ("user_2",  "object_2",  2),                        ("user_2",  "object_2",  6)]) df = sqlContext.createDataFrame(rdd, ["user_id", "object_id", "score"])

121

answered Sep 22 '22 05:09

mtoto

Related questions
                            
                                Debugging with PyCharm terminal arguments
                            
                                Windows- Pyinstaller Error "failed to execute script " When App Clicked
                            
                                Python lookup hostname from IP with 1 second timeout
                            
                                Cannot find vcvarsall.bat when running a Python script
                            
                                making matplotlib graphs look like R by default?
                            
                                AttributeError while querying: Neither 'InstrumentedAttribute' object nor 'Comparator' has an attribute
                            
                                Python range() and zip() object type
                            
                                How would I compute exactly 30 days into the past with Python (down to the minute)?
                            
                                How can I send an xml body using requests library?
                            
                                log4j with timestamp per log entry
                            
                                Make function definition in a python file order independent
                            
                                How do I create a new database in MongoDB using PyMongo?
                            
                                Iterate over all combinations of values in multiple lists in Python
                            
                                String literal with triple quotes in function definitions
                            
                                Remove all newlines from inside a string
                            
                                Problems with using a rough greyscale algorithm?
                            
                                How to create a large pandas dataframe from an sql query without running out of memory?
                            
                                Activate python virtualenv in Dockerfile
                            
                                How to copy all properties of an object to another object, in Python?
                            
                                How to set Selenium Python WebDriver default timeout?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Retrieve top n in each group of a DataFrame in pyspark

Tags:

python

dataframe

apache-spark

apache-spark-sql

pyspark

KAs

People also ask

1 Answers

Data

mtoto

Recent Activity

Donate For Us