Keeping a unused column in Spark when using group by?

Tags:

So I have a dataframe of usernames what threads they have posted in and the timestamp of those posts. What I am trying to do if figure out who was the first user for a thread and what time it was. I know to figure out the first post is to do a group by on a thread and then a min on the timestamp. But that removes the username. how do I use the group by and keep the usernames?

711

asked Oct 26 '16 20:10

atdy17

2 Answers

You can do this with one groupBy by using a HiveContext and the Hive named_struct function. The trick is min will work on a struct by evaluating the columns in order from left to right, and only moving onto the next one if there the current column is equal. So, in this case, it is really just comparing the timestamp column, but by making a struct that includes the name you will have access to that after the min function spits out the result.

data = [
    ('user', 'thread', 'ts'),
    ('ryan', 1, 1234),
    ('bob', 1, 2345),
    ('bob', 2, 1234),
    ('john', 2, 2223)
]

header = data[0]
rdd = sc.parallelize(data[1:])
df = sqlContext.createDataFrame(rdd, header)
df.registerTempTable('table')

sql = """
SELECT thread, min(named_struct('ts', ts, 'user', user)) as earliest
FROM table
GROUP BY thread
"""

grouped = sqlContext.sql(sql)
final = grouped.selectExpr('thread', 'earliest.user as user', 'earliest.ts as timestamp')

answered Sep 22 '22 16:09

Ryan Widmaier

This can be done using the row_number() window function, this will keep all other columns intact. Use withColumn to create a new column something like "thread_user_order" and its value should be row_number() PARTITION BY thread ORDER BY ts. Then filter "thread_user_order" == 1.

Here is some pseudo code:

df.withColumn("thread_user_order", row_number().over(Window.partitionBy(col("thread")).orderBy(col("ts")))).where(col("thread_user_order").equalTo(1))

answered Sep 24 '22 16:09

Sergey Bahchissaraitsev

Related questions
                            
                                How can I replicate excel COUNTIFS in python/pandas?
                            
                                error: 'T' is not a template
                            
                                Passing an ArrayList report shows comma every item
                            
                                Why are the messages sent over WebRTC received in a different order sometimes?
                            
                                WebApi2 IHttpActionResult strongly typed return values
                            
                                Load external image source using variable
                            
                                Azure Cloud Service CPU max after upgrade from SDK 2.8 to 2.9
                            
                                Resetting update_geom_defaults() in ggplot2
                            
                                Replace properties with it's values in pom.xml before putting the pom file into maven repository
                            
                                Overriding default values with operation class in Swift 3
                            
                                How to ensure with angularjs2 textbox fits what is input in text?
                            
                                How do I run an arbitrary npm script on an Elastic Beanstalk instance

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With