Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hive: Sum over a specified group (HiveQL)

I have a table:

key    product_code    cost
1      UK              20
1      US              10
1      EU              5
2      UK              3
2      EU              6

I would like to find the sum of all products for each group of "key" and append to each row. For example for key = 1, find the sum of costs of all products (20+10+5=35) and then append result to all rows which correspond to the key = 1. So end result:

key    product_code    cost     total_costs
1      UK              20       35
1      US              10       35
1      EU              5        35
2      UK              3        9
2      EU              6        9

I would prefer to do this without using a sub-join as this would be inefficient. My best idea would be to use the over function in conjunction with the sum function but I cant get it to work. My best try:

SELECT key, product_code, sum(costs) over(PARTITION BY key)
FROM test
GROUP BY key, product_code;

Iv had a look at the docs but there so cryptic I have no idea how to work out how to do it. Im using Hive v0.12.0, HDP v2.0.6, HortonWorks Hadoop distribution.

like image 676
joshlk Avatar asked Aug 01 '14 14:08

joshlk


2 Answers

Similar to @VB_ answer, use the BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING statement.

The HiveQL query is therefore:

SELECT key, product_code,
SUM(costs) OVER (PARTITION BY key ORDER BY key ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM test;
like image 172
joshlk Avatar answered Oct 23 '22 06:10

joshlk


The analytics function sum gives cumulative sums. For example, if you did:

select key, product_code, cost, sum(cost) over (partition by key) as total_costs from test

then you would get:

key    product_code    cost     total_costs
1      UK              20       20
1      US              10       30
1      EU              5        35
2      UK              3        3
2      EU              6        9

which, it seems, is not what you want.

Instead, you should use the aggregation function sum, combined with a self join to accomplish this:

select test.key, test.product_code, test.cost, agg.total_cost
from (
  select key, sum(cost) as total_cost
  from test
  group by key
) agg
join test
on agg.key = test.key;
like image 43
Joe K Avatar answered Oct 23 '22 06:10

Joe K