I try to write Hive Sql like that
SELECT count(1), substr(date, 1, 4) as year
FROM ***
GROUP BY year
But Hive cannot recognize the alias name 'year', it complains that: FAILED: SemanticException [Error 10004]: Line 1:79 Invalid table alias or column reference 'year'
One solution(Hive: SELECT AS and GROUP BY) suggest to use 'GROUP BY substr(date, 1, 4)'.
It works! However in some cases the value I want to group by may be generated from multiple lines of hive function code, it's very ugly to write code like
SELECT count(1), func1(func2(..........................)) AS something
FROM ***
GROUP BY func1(func2(..........................))
Is there any clean way in Hive to do that? Any suggestions?
Hive's massive parallelism eliminates many of the disk-I/O limitations of an RDBMS, reducing the value of normalization for reducing data volume. Hive is often used with data volumes for which it would impractical to use normalized data.
Group by query:Group by clause use columns on Hive tables for grouping particular column values mentioned with the group by. For whatever the column name we are defining a “groupby” clause the query will selects and display results by grouping the particular column values.
Solution. Order the records first and then apply the LIMIT clause to limit the number of records.
In Hive 0.11.0 and later, columns can be specified by position if hive.groupby.orderby.position.alias is set to true (the default is false).
So setting set hive.groupby.orderby.position.alias=true;
in your .hql (or .hiverc for a permanent solution) will do the trick and then you can type group by 2
for the above example.
Source: hive language manual
Specifying the position in Group By will solve your issue. This position number in Group By works even when SET hive.groupby.orderby.position.alias=false; (Hive 0.12)
SELECT count(1), substr(date, 1, 4) as year
FROM ***
GROUP BY 2;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With