Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Summing values of Hive array types

Hive has this pretty nice Array type that is very useful in theory but when it comes to practice I found very little information on how to do any kind of opeartions with it. We store a serie of numbers in an array type column and need to SUM them in a query, preferably from n-th to m-th element. Is it possible with standard HiveQL or does it require a UDF or customer mapper/reducer?

Note: we're using Hive 0.8.1 in EMR environment.

like image 844
Alex N. Avatar asked Sep 12 '12 03:09

Alex N.


2 Answers

I'd write a simple UDF for this purpose. You need to have hive-exec in your build path.
E.g In case of Maven:

<dependency>
  <groupId>org.apache.hive</groupId>
  <artifactId>hive-exec</artifactId>
  <version>0.8.1</version>
</dependency>

A simple raw implementation would look like this:

package com.myexample;

import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.IntWritable;

public class SubArraySum extends UDF {

    public IntWritable evaluate(ArrayList<Integer> list, 
      IntWritable from, IntWritable to) {
        IntWritable result = new IntWritable(-1);
        if (list == null || list.size() < 1) {
            return result;
        }

        int m = from.get();
        int n = to.get();

        //m: inclusive, n:exclusive
        List<Integer> subList = list.subList(m, n);

        int sum = 0;
        for (Integer i : subList) {
            sum += i;
        }
        result.set(sum);
        return result;
    }
}

Next, build a jar and load it in Hive shell:

hive> add jar /home/user/jar/myjar.jar;
hive> create temporary function subarraysum as 'com.myexample.SubArraySum';

Now you can use it to calculate the sum of the array you have.

E.g:

Let's assume that you have an input file having tab-separated columns in it :

1   0,1,2,3,4
2   5,6,7,8,9

Load it into mytable:

hive> create external table mytable (
  id int,
  nums array<int>
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/hadoopuser/hive/input';

Execute some queries then:

hive> select * from mytable;
1   [0,1,2,3,4]
2   [5,6,7,8,9]

Sum it in range m,n where m=1, n=3

hive> select subarraysum(nums, 1,3) from mytable;
3
13

Or

hive> select sum(subarraysum(nums, 1,3)) from mytable;
16
like image 156
Lorand Bendig Avatar answered Sep 18 '22 16:09

Lorand Bendig


The answer above is quite well explained. I am posting a very simple implementation of the UDF.

package com.ak.hive.udf.test;

import java.util.ArrayList;

import org.apache.hadoop.hive.ql.exec.UDF;

    public final class ArraySumUDF extends UDF {
        public int evaluate(ArrayList<Integer>arrayOfIntegers,int startIndex,int endIndex) {
            // add code to handle all index problem
                    int sum=0;
            int count=startIndex-1;
            for(;count<endIndex;count++){
                sum+=arrayOfIntegers.get(count);
            }
            return sum;
        }
    }

Also posting the table creation and other queries.

create table table1 (col1 int,col2 array<int>)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '~' STORED AS TEXTFILE;

load data local inpath '/home/ak/Desktop/hivedata' into table table1;

My input file would look like

1,3~5~8~5~7~9
2,93~5~8~5~7~29
3,3~95~8~5~27~9
4,3~5~58~15~7~9
5,3~25~8~55~7~49
6,3~25~8~15~7~19
7,3~55~78~5~7~9

I have created a jar of my UDF, I add the jar to hive using the following command

add jar file:///home/ak/Desktop/array.jar;

Then I create temporary function as shown

create temporary function getSum as 'com.ak.hive.udf.test.ArraySumUDF';

Perform a sample query as below,

select col1,getSum(col2,1,3) from table1;

This should solve the very basic need. In case if this is not what the problem statement is, please respond back so that I can help you with again.

like image 31
Arun A K Avatar answered Sep 16 '22 16:09

Arun A K