Hive getting top n records in group by query

Tags:

I have following table in hive

user-id, user-name, user-address,clicks,impressions,page-id,page-name

I need to find out top 5 users[user-id,user-name,user-address] by clicks for each page [page-id,page-name]

I understand that we need to first group by [page-id,page-name] and within each group I want to orderby [clicks,impressions] desc and then emit only top 5 users[user-id, user-name, user-address] for each page but I am finding it difficult to construct the query.

How can we do this using HIve UDF ?

638

asked Feb 22 '12 07:02

TopCoder

2 Answers

As of Hive 0.11, you can do this using Hive's built in rank() function and using simpler semantics using Hive's built-in Analytics and Windowing functions. Sadly, I couldn't find as many examples with these as I would have liked, but they are really, really useful. Using those, both rank() and WhereWithRankCond are built in, so you can just do:

SELECT page-id, user-id, clicks FROM (     SELECT page-id, user-id, rank()             over (PARTITION BY page-id ORDER BY clicks DESC) as rank, clicks      FROM my table ) ranked_mytable WHERE ranked_mytable.rank < 5 ORDER BY page-id, rank

No UDF required, and only one subquery! Also, all of the rank logic is localized.

You can find some more (though not enough for my liking) examples of these functions in this Jira and on this guy's blog.

163

answered Oct 23 '22 17:10

Eli

Revised answer, fixing the bug as mentioned by @Himanshu Gahlot

SELECT page-id, user-id, clicks FROM (     SELECT page-id, user-id, rank(page-id) as rank, clicks FROM (         SELECT page-id, user-id, clicks FROM mytable         DISTRIBUTE BY page-id         SORT BY page-id, clicks desc ) a ) b WHERE rank < 5 ORDER BY page-id, rank

Note that the rank() UDAF is applied to the page-id column, whose new value is used to reset or increase the rank counter (e.g. reset counter for each page-id partition)

answered Oct 23 '22 18:10

Hai-Anh Trinh

Related questions
                            
                                Background process in linux
                            
                                How to Load a File for Testing with Jasmine Node?
                            
                                SqlCommand Parameters size confusion
                            
                                How to navigate to to different directories in the terminal (mac)?
                            
                                How do you git fetch then merge? "Error: Your local changes to the following files would be overwritten by merge"
                            
                                SQLAlchemy proper session handling in multi-thread applications
                            
                                Why does Math.cos(90 * Math.PI/180) yield 6.123031769111... and not zero? [duplicate]
                            
                                How to set up Git with Aptana Studio 3?
                            
                                Unity 3D: What is the Android Bundle Version and Version Code and how do they relate?
                            
                                Assign class boolean value in Python
                            
                                How to use AVX/pclmulqdq on Mac OS X
                            
                                Pass Javascript Variable to PHP POST [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With