I have a UDF (GetUrlExt) that returns extension. (ex: jpg in /abc/models/xyz/images/top.jpg). The data is like shown below:
Date Time TimeTaken uristem
9/5/2011 0:00:10 234 /abc/models/xyz/images/top.jpg
9/5/2011 0:00:11 456 /abc/models/xyz/images/bottom.jpg
9/5/2011 0:00:14 789 /abc/models/xyz/images/left.gif
9/5/2011 0:00:16 234 /abc/models/xyz/images/top.pdf
9/5/2011 0:00:18 734 /abc/models/xyz/images/top.pdf
9/5/2011 0:00:19 654 /abc/models/xyz/images/right.gif
9/5/2011 0:00:21 346 /abc/models/xyz/images/top.pdf
9/5/2011 0:00:24 556 /abc/models/xyz/images/front.pdf
9/5/2011 0:00:26 134 /abc/models/xyz/images/back.jpg
The query without 'GROUP BY' is working fine:
SELECT GetUrlExt(uristem) AS extn FROM LogTable;
Result: jpg jpg gif pdf pdf gif pdf pdf jpg
Now I need 'GROUP BY' on the results of the GetUrlExt UDF.
Expected Result:
jpg 3 274.6
gif 2 721.5
pdf 4 467.5
But the following query is not working:
SELECT GetUrlExt(uristem) AS extn, Count(*) AS PerCount, Avg(TimeTaken) AS AvgTime FROM LogTable GROUP BY extn;
Any kind of help is appreciated!
User Defined Functions, also known as UDF, allow you to create custom functions to process records or groups of records. Hive comes with a comprehensive library of functions. There are however some omissions, and some specific cases for which UDFs are the solution.
In hive you can overload the method same as JAVA. But in UDF you have to use Hadoop Datatypes likes IntWritable, FloatWritable...
1) UDF Operates on a single row and produces a single row as its output has most of the functions, such as mathematical functions. 2) UDAF'S:- UDAF works on multiple input rows and creates a single output row and aggregate functions which include functions such as count and MAX.
Spark SQL supports integration of Hive UDFs, UDAFs and UDTFs. Similar to Spark UDFs and UDAFs, Hive UDFs work on a single row as input and generate a single row as output, while Hive UDAFs operate on multiple rows and return a single aggregated row as a result.
Pls use subquery to group by.
Hive doesn't support group by calculated value directly.
SELECT a.extn, Count(*) AS PerCount, Avg(TimeTaken) AS AvgTime
FROM
(
SELECT GetUrlExt(uristem) AS extn, TimeTaken
FROM LogTable
) a
GROUP BY a.extn;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With