Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get array/bag of elements from Hive group by operator?

Tags:

I want to group by a given field and get the output with grouped fields. Below is an example of what I am trying to achieve:-

Imagine a table named 'sample_table' with two columns as below:-

F1  F2 001 111 001 222 001 123 002 222 002 333 003 555 

I want to write Hive Query that will give the below output:-

001 [111, 222, 123] 002 [222, 333] 003 [555] 

In Pig, this can be very easily achieved by something like this:-

grouped_relation = GROUP sample_table BY F1; 

Can somebody please suggest if there is a simple way to do so in Hive? What I can think of is to write a User Defined Function (UDF) for this but this may be a very time consuming option.

like image 459
Anuroop Avatar asked May 08 '13 15:05

Anuroop


People also ask

How do I query an array in Hive?

The Hive split functions split given string into an array of values. This function will split on the given delimiter or a regular expression. Following is the syntax of split array function. where str is a string value to be split and pat is a delimiter or a regular expression.

What is group by in Hive?

The GROUP BY clause is used to group all the records in a result set using a particular collection column. It is used to query a group of records.

How do you find AVG in Hive?

count(*), count(expr), count(*) - Returns the total number of retrieved rows. It returns the sum of the elements in the group or the sum of the distinct values of the column in the group. It returns the average of the elements in the group or the average of the distinct values of the column in the group.

How do I create a subquery in Hive?

Hive supports subqueries in FROM clauses and in WHERE clauses of SQL statements. A subquery is a SQL expression that is evaluated and returns a result set. Then that result set is used to evaluate the parent query. The parent query is the outer query that contains the child subquery.


1 Answers

The built in aggregate function collect_set (doumented here) gets you almost what you want. It would actually work on your example input:

SELECT F1, collect_set(F2) FROM sample_table GROUP BY F1 

Unfortunately, it also removes duplicate elements and I imagine this isn't your desired behavior. I find it odd that collect_set exists, but no version to keep duplicates. Someone else apparently thought the same thing. It looks like the top and second answer there will give you the UDAF you need.

like image 118
Daniel Koverman Avatar answered Oct 24 '22 00:10

Daniel Koverman