Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hive getting top n records in group by query

Tags:

I have following table in hive

user-id, user-name, user-address,clicks,impressions,page-id,page-name

I need to find out top 5 users[user-id,user-name,user-address] by clicks for each page [page-id,page-name]

I understand that we need to first group by [page-id,page-name] and within each group I want to orderby [clicks,impressions] desc and then emit only top 5 users[user-id, user-name, user-address] for each page but I am finding it difficult to construct the query.

How can we do this using HIve UDF ?

like image 638
TopCoder Avatar asked Feb 22 '12 07:02

TopCoder


People also ask

How do I SELECT top 10 rows in Hive?

Solution. Order the records first and then apply the LIMIT clause to limit the number of records.

Does Hive prefer normalization?

Hive's massive parallelism eliminates many of the disk-I/O limitations of an RDBMS, reducing the value of normalization for reducing data volume. Hive is often used with data volumes for which it would impractical to use normalized data.

How do I limit rows in Hive?

The LIMIT clause can be used to constrain the number of rows returned by the SELECT statement. LIMIT takes one or two numeric arguments, which must both be non-negative integer constants. The first argument specifies the offset of the first row to return (as of Hive 2.0.

How do you query faster in Hive?

Performance tuning is key to optimizing a Hive query. First, tweak your data through partitioning, bucketing, compression, etc. Improving the execution of a hive query is another Hive query optimization technique. You can do this by using Tez, avoiding skew, and increasing parallel execution.


2 Answers

As of Hive 0.11, you can do this using Hive's built in rank() function and using simpler semantics using Hive's built-in Analytics and Windowing functions. Sadly, I couldn't find as many examples with these as I would have liked, but they are really, really useful. Using those, both rank() and WhereWithRankCond are built in, so you can just do:

SELECT page-id, user-id, clicks FROM (     SELECT page-id, user-id, rank()             over (PARTITION BY page-id ORDER BY clicks DESC) as rank, clicks      FROM my table ) ranked_mytable WHERE ranked_mytable.rank < 5 ORDER BY page-id, rank 

No UDF required, and only one subquery! Also, all of the rank logic is localized.

You can find some more (though not enough for my liking) examples of these functions in this Jira and on this guy's blog.

like image 163
Eli Avatar answered Oct 23 '22 17:10

Eli


Revised answer, fixing the bug as mentioned by @Himanshu Gahlot

SELECT page-id, user-id, clicks FROM (     SELECT page-id, user-id, rank(page-id) as rank, clicks FROM (         SELECT page-id, user-id, clicks FROM mytable         DISTRIBUTE BY page-id         SORT BY page-id, clicks desc ) a ) b WHERE rank < 5 ORDER BY page-id, rank 

Note that the rank() UDAF is applied to the page-id column, whose new value is used to reset or increase the rank counter (e.g. reset counter for each page-id partition)

like image 24
Hai-Anh Trinh Avatar answered Oct 23 '22 18:10

Hai-Anh Trinh