Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sorting within collect_list() in hive

Tags:

hive

hiveql

Let's say I have a hive table that looks like this:

ID    event    order_num
------------------------
A      red         2
A      blue        1
A      yellow      3
B      yellow      2
B      green       1
...

I'm trying to use collect_list to generate a list of events for each ID. So something like the following:

SELECT ID, 
collect_list(event) as events_list,
FROM table
GROUP BY ID;

However, within each of the IDs that I group by, I need to sort by order_num. So that my resulting table would look like this:

ID    events_list
------------------------
A      ["blue","red","yellow"]
B      ["green","red"]

I can't do a global sort by ID and order_num before the collect_list() query because the table is massive. Is there a way to sort by order_num within collect_list?

Thanks!

like image 895
Slyron Avatar asked Jun 08 '18 18:06

Slyron


People also ask

What is Collect_list in hive?

collect_set(col) Returns a set of objects(array) with duplicate elements eliminated. collect_list(col) Returns a list of objects(array) with duplicates.

Does COLLECT_ list maintain order?

Collect_list uses ArrayList, so the data will be kept in the same order they were added, to do that, uou need to use SORT BY clause in a subquery, don't use ORDER BY, it will cause your query to execute in a non-distributed way.

How do I sort an array in hive?

Hive sort_array FunctionThe sort_array function sorts the input array in ascending order according to the natural ordering of the array elements and returns it. Following is the syntax of sort_array function. Where, T is a string of type array.

What is COLLECT_ SET in sql?

pyspark.sql.functions. collect_set (col)[source] Aggregate function: returns a set of objects with duplicate elements eliminated.


2 Answers

So, I found the answer here. The trick is to use a subquery with a DISTRIBUTE BY and SORT BY statement. See below:

WITH table1 AS (
    SELECT 'A' AS ID, 'red' AS event, 2 AS order_num UNION ALL
    SELECT 'A' AS ID, 'blue' AS event, 1 AS order_num UNION ALL
    SELECT 'A' AS ID, 'yellow' AS event, 3 AS order_num UNION ALL
    SELECT 'B' AS ID, 'yellow' AS event, 2 AS order_num UNION ALL
    SELECT 'B' AS ID, 'green' AS event, 1 AS order_num
)

-- Collect it
SELECT subquery.ID, 
collect_list(subquery.event) as events_list
FROM (
SELECT
        table1.ID,
        table1.event,
        table1.order_num
    FROM table1
    DISTRIBUTE BY
        table1.ID
    SORT BY
        table1.ID,
        table1.order_num
) subquery
GROUP BY subquery.ID;
like image 167
Slyron Avatar answered Sep 23 '22 11:09

Slyron


The function sort_array() should sort the collect_list() items

select ID, sort_array(collect_list(event)) as events_list,
from table
group by ID;
like image 24
nobody Avatar answered Sep 23 '22 11:09

nobody