Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Computing running sum using hive udf functions

Tags:

sql

hadoop

hive

I am a newbie in Hive and would like to pardon my ignorance beforehand for any stuffs below. I have a table as follows:

SELECT a.storeid, a.smonth, a.sales FROM table a;
1001    1       35000.0
1002    2       35000.0
1001    2       25000.0
1002    3       110000.0
1001    3       40000.0
1002    1       40000.0

My objective output is as follows:

1001    1       35000.0 35000.0
1001    2       25000.0 60000.0
1001    3       40000.0 100000.0
1002    1       40000.0 40000.0
1002    2       35000.0 75000.0
1002    3       110000.0 185000.0

I have written a simple hive udf sum class to achieve the above and used SORT BY storeid, smonth in the query:

SELECT a.storeid, a.smonth, a.sales, rsum(sales)
FROM (SELECT * FROM table SORT BY storeid, smonth) a;

Obviously, it does not produce the above output since there is only one mapper and the same udf instance is invoked which generates a running sum on the total set. My objective is to reset the runningSum instance variable in the udf class for each storeid so that the evaluate function returns the above output. I have used the following: 1. Pass the storeid variable rsum(sales, storeid) and then we can handle the situation correctly in the udf class. 2. Using 2 mappers as in the following query:

set mapred.reduce.tasks=2;
SELECT a.storeid, a.smonth, a.sales, rsum(sales)
FROM (SELECT * FROM table DISTRIBUTE BY storeid SORT BY storeid, smonth) a;

1002    1       40000.0 40000.0
1002    2       35000.0 75000.0
1002    3       110000.0 185000.0
1001    1       35000.0 35000.0
1001    2       25000.0 60000.0
1001    3       40000.0 100000.0

Why the 1002 appears at the top always? I would like to seek your suggestions on the other different methods in which I can achieve the same (such as sub-queries/joins) apart from the above methods. Also, what will be the time complexities of your suggested methods?

like image 406
Code Warrior Avatar asked Jan 28 '13 22:01

Code Warrior


2 Answers

Hive provides a better way to do this in a single row -
Please follow below process to achieve your target output

create a hive table which can contain your data set -

1001    1       35000.0
1002    2       35000.0
1001    2       25000.0
1002    3       110000.0
1001    3       40000.0
1002    1       40000.0

Now simply run below command in your hive terminal -

SELECT storeid, smonth, sales, SUM(sales) OVER (PARTITION BY storeid ORDER BY smonth) FROM table_name;

The output will be like -

1001  1  35000.0  35000.0
1001  2  25000.0  60000.0
1001  3  40000.0  100000.0
1002  1  40000.0  40000.0
1002  2  35000.0  75000.0
1002  3  110000.0 185000.0

I hope this may help you to get your target output.

like image 160
Bector Avatar answered Oct 04 '22 20:10

Bector


Alternatively, you might have a look at this Hive ticket which contains several feature extensions.
Among others there's a cumulative sum implementation (GenericUDFSum).

This function (referred to as "rsum") takes two arguments, the hash of the id (by which the records are partitioned among the reducers) and their corresponding values to be summed up:

select t.storeid, t.smonth, t.sales, rsum(hash(t.storeid),t.sales) as sales_sum 
  from (select storeid, smonth, sales from sm distribute by hash(storeid) 
    sort by storeid, smonth) t;

1001  1  35000.0  35000.0
1001  2  25000.0  60000.0
1001  3  40000.0  100000.0
1002  1  40000.0  40000.0
1002  2  35000.0  75000.0
1002  3  110000.0 185000.0
like image 30
Lorand Bendig Avatar answered Oct 04 '22 21:10

Lorand Bendig