Hive has this pretty nice Array type that is very useful in theory but when it comes to practice I found very little information on how to do any kind of opeartions with it. We store a serie of numbers in an array type column and need to SUM them in a query, preferably from n-th to m-th element. Is it possible with standard HiveQL or does it require a UDF or customer mapper/reducer?
Note: we're using Hive 0.8.1 in EMR environment.
I'd write a simple UDF
for this purpose. You need to have hive-exec
in your build path.
E.g In case of Maven
:
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>0.8.1</version>
</dependency>
A simple raw implementation would look like this:
package com.myexample;
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.IntWritable;
public class SubArraySum extends UDF {
public IntWritable evaluate(ArrayList<Integer> list,
IntWritable from, IntWritable to) {
IntWritable result = new IntWritable(-1);
if (list == null || list.size() < 1) {
return result;
}
int m = from.get();
int n = to.get();
//m: inclusive, n:exclusive
List<Integer> subList = list.subList(m, n);
int sum = 0;
for (Integer i : subList) {
sum += i;
}
result.set(sum);
return result;
}
}
Next, build a jar and load it in Hive shell:
hive> add jar /home/user/jar/myjar.jar;
hive> create temporary function subarraysum as 'com.myexample.SubArraySum';
Now you can use it to calculate the sum of the array you have.
E.g:
Let's assume that you have an input file having tab-separated columns in it :
1 0,1,2,3,4
2 5,6,7,8,9
Load it into mytable:
hive> create external table mytable (
id int,
nums array<int>
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/hadoopuser/hive/input';
Execute some queries then:
hive> select * from mytable;
1 [0,1,2,3,4]
2 [5,6,7,8,9]
Sum it in range m,n where m=1, n=3
hive> select subarraysum(nums, 1,3) from mytable;
3
13
Or
hive> select sum(subarraysum(nums, 1,3)) from mytable;
16
The answer above is quite well explained. I am posting a very simple implementation of the UDF.
package com.ak.hive.udf.test;
import java.util.ArrayList;
import org.apache.hadoop.hive.ql.exec.UDF;
public final class ArraySumUDF extends UDF {
public int evaluate(ArrayList<Integer>arrayOfIntegers,int startIndex,int endIndex) {
// add code to handle all index problem
int sum=0;
int count=startIndex-1;
for(;count<endIndex;count++){
sum+=arrayOfIntegers.get(count);
}
return sum;
}
}
Also posting the table creation and other queries.
create table table1 (col1 int,col2 array<int>)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '~' STORED AS TEXTFILE;
load data local inpath '/home/ak/Desktop/hivedata' into table table1;
My input file would look like
1,3~5~8~5~7~9
2,93~5~8~5~7~29
3,3~95~8~5~27~9
4,3~5~58~15~7~9
5,3~25~8~55~7~49
6,3~25~8~15~7~19
7,3~55~78~5~7~9
I have created a jar of my UDF, I add the jar to hive using the following command
add jar file:///home/ak/Desktop/array.jar;
Then I create temporary function as shown
create temporary function getSum as 'com.ak.hive.udf.test.ArraySumUDF';
Perform a sample query as below,
select col1,getSum(col2,1,3) from table1;
This should solve the very basic need. In case if this is not what the problem statement is, please respond back so that I can help you with again.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With