Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hive: UDF and GROUP BY

Tags:

hive

hiveql

I have a UDF (GetUrlExt) that returns extension. (ex: jpg in /abc/models/xyz/images/top.jpg). The data is like shown below:

Date Time TimeTaken uristem  
9/5/2011 0:00:10 234 /abc/models/xyz/images/top.jpg  
9/5/2011 0:00:11 456 /abc/models/xyz/images/bottom.jpg  
9/5/2011 0:00:14 789 /abc/models/xyz/images/left.gif  
9/5/2011 0:00:16 234 /abc/models/xyz/images/top.pdf  
9/5/2011 0:00:18 734 /abc/models/xyz/images/top.pdf  
9/5/2011 0:00:19 654 /abc/models/xyz/images/right.gif  
9/5/2011 0:00:21 346 /abc/models/xyz/images/top.pdf  
9/5/2011 0:00:24 556 /abc/models/xyz/images/front.pdf  
9/5/2011 0:00:26 134 /abc/models/xyz/images/back.jpg

The query without 'GROUP BY' is working fine:

SELECT GetUrlExt(uristem) AS extn FROM LogTable; 

Result: jpg jpg gif pdf pdf gif pdf pdf jpg

Now I need 'GROUP BY' on the results of the GetUrlExt UDF.
Expected Result:
jpg 3 274.6
gif 2 721.5
pdf 4 467.5

But the following query is not working:

SELECT GetUrlExt(uristem) AS extn, Count(*) AS PerCount, Avg(TimeTaken) AS AvgTime FROM LogTable GROUP BY extn;

Any kind of help is appreciated!

like image 424
Srinivas Avatar asked Nov 20 '12 09:11

Srinivas


People also ask

What is UDF in Hive?

User Defined Functions, also known as UDF, allow you to create custom functions to process records or groups of records. Hive comes with a comprehensive library of functions. There are however some omissions, and some specific cases for which UDFs are the solution.

Which method has to be overridden when we use custom UDF in Hive?

In hive you can overload the method same as JAVA. But in UDF you have to use Hadoop Datatypes likes IntWritable, FloatWritable...

What is UDF and UDAF?

1) UDF Operates on a single row and produces a single row as its output has most of the functions, such as mathematical functions. 2) UDAF'S:- UDAF works on multiple input rows and creates a single output row and aggregate functions which include functions such as count and MAX.

Can we use Hive UDF in spark?

Spark SQL supports integration of Hive UDFs, UDAFs and UDTFs. Similar to Spark UDFs and UDAFs, Hive UDFs work on a single row as input and generate a single row as output, while Hive UDAFs operate on multiple rows and return a single aggregated row as a result.


1 Answers

Pls use subquery to group by.

Hive doesn't support group by calculated value directly.

SELECT a.extn, Count(*) AS PerCount, Avg(TimeTaken) AS AvgTime 
FROM
(
    SELECT GetUrlExt(uristem) AS extn, TimeTaken
    FROM LogTable 
) a
GROUP BY a.extn;
like image 186
pensz Avatar answered Sep 24 '22 05:09

pensz