Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Aggregate strings in group by and ordered in Hive and Presto

I have a table in the following format:

IDX IDY Time Text
idx1 idy1 t1 text1
idx1 idy2 t2 text2
idx1 idy2 t3 text3
idx1 idy1 t4 text4
idx2 idy3 t5 text5
idx2 idy3 t6 text6
idx2 idy1 t7 text7
idx2 idy3 t8 text8

What I'd like to see is something like this:

idx1 text1
idx1 text2, text3
idx1 text4
idx2 text5, text6
idx2 text7
idx2 text8

So in the final phase, I can get to:

text1
text2, text3
text4
==SEPERATOR==
text5, text6
text7
text8

How can I perform this in Hive or Presto? Thanks.

like image 570
Nick Avatar asked Sep 18 '25 14:09

Nick


1 Answers

Hive

This is the base query, you can take it from here if you like

select  IDX
       ,IDY
       ,min(time)                           as from_time
       ,max(time)                           as to_time
       ,concat_ws(',',collect_list (Text))  as text

from   (select  *
               ,row_number () over 
                (
                    partition by    IDX
                    order by        Time
                )   as rn
               ,row_number () over 
                (
                    partition by    IDX,IDY
                    order by        Time
                )   as rn_IDY

        from    mytable
        ) t

group by    IDX,IDY
           ,rn - rn_IDY

order by    IDX,from_time

+------+------+-----------+---------+-------------+
| idx  | idy  | from_time | to_time |    text     |
+------+------+-----------+---------+-------------+
| idx1 | idy1 | t1        | t1      | text1       |
| idx1 | idy2 | t2        | t3      | text2,text3 |
| idx1 | idy1 | t4        | t4      | text4       |
| idx2 | idy3 | t5        | t6      | text5,text6 |
| idx2 | idy1 | t7        | t7      | text7       |
| idx2 | idy3 | t8        | t8      | text8       |
+------+------+-----------+---------+-------------+

Presto

select  array_join(array_agg (Text),',')   as text

from   (select  *
               ,row_number () over 
                (
                    partition by    IDX
                    order by        Time
                )   as rn
               ,row_number () over 
                (
                    partition by    IDX,IDY
                    order by        Time
                )   as rn_IDY

        from    mytable
        ) t

group by    IDX,IDY
           ,rn - rn_IDY

order by    IDX,min(time)
;

+-------------+
|    text     |
+-------------+
| text1       |
| text2,text3 |
| text4       |
| text5,text6 |
| text7       |
| text8       |
+-------------+
like image 113
David דודו Markovitz Avatar answered Sep 21 '25 05:09

David דודו Markovitz