I am new to HIVE and SPARK.
Consider I have following query in SQL.
select col1, col2, min(col3), first(col4) from tablename group by col1, col2
As I dont want to include the col4 in group by I have taken first(col4) (But I want col4 to be displayed)
I want to write the same query in Hive, but in hive there is no first function.
Ref : https://docs.treasuredata.com/articles/hive-aggregate-functions
I want to write the same query in Spark SQL (Using dataframes).
Similarly in spark aggregate functions there is no first function. (* The available aggregate methods are avg
, max
, min
, sum
, count
. *)
Ref : org.apache.spark.sql.GroupedData
E.g. :
val df1 = sqlContext.sql(" select * from TABLENAME").groupBy("COL1","COL2").agg("COL3" -> "MIN", "COL4" -> "????")
Is there any way to write the same with hive and spark ?
If you don't want to include col4
in group by
and include it in select along with sum()
, you need analytics functions:
select col1,
col2,
min(col3) over (partition by col1, col2) as min_col3,
col4
from tablename;
See docs here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
select col1
,col2
,s34.col3
,s34.col4
from (select col1
,col2
,min(named_struct('col3',col3,'col4',col4)) as s34
from tablename
group by col1
,col2
) t
;
or
select col1
,col2
,s34.col1 as col3
,s34.col2 as col4
from (select col1
,col2
,min(struct(col3,col4)) as s34
from tablename
group by col1
,col2
) t
;
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With