Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Athena ALIAS in Group By does not get resolved

I have a very basic group by query in Athena where I would like to use an alias. One can make the example work by putting the same reference in the group by, but that's not really handy when there's complex column modifications going on and logic needs to be copied in two places. Also I did that in the past and now I have a statement that doesn't work by copying over.

Problem:

SELECT 
    substr(accountDescriptor, 5) as account, 
    sum(revenue) as grossRevenue 
FROM sales 
GROUP BY account

This will throw an error:

alias Column 'account' cannot be resolved

The following works, so it's about the alias handling.

SELECT 
    substr(accountDescriptor, 5) as account, 
    sum(revenue) as grossRevenue 
FROM sales 
GROUP BY substr(accountDescriptor, 5)
like image 207
supernova Avatar asked Feb 10 '20 00:02

supernova


People also ask

Why do I get zero records when I query my Amazon Athena table?

Incorrect LOCATION path If the input LOCATION path is incorrect, then Athena returns zero records.

Does Athena cache query results?

No, Athena doesn't support query caching.

Why is AWS Athena so slow?

Athena Performance Issues Unlike full database products, it does not have its own optimized storage layer. Therefore its performance is strongly dependent on how data is organized in S3—if data is sorted to allow efficient metadata based filtering, it will perform fast, and if not, some queries may be very slow.


Video Answer


2 Answers

That is because SQL is evaluated in certain order, like table scan, filter, aggregation, projection, sort. You tried to use the result of projection as input of aggregation. In many cases it could be possible (where projection is trivial, like your case), but it such behaviour is not defined in ANSI SQL (which Presto and so Athena follows).

We see that in many cases it is very useful so, support for this might be added in future (extending ANSI SQL).

Currently, there are several ways to overcome this:

SELECT account, sum(revenue) as grossRevenue 
FROM (SELECT substr(accountDescriptor, 5) as account, revenue FROM sales)
GROUP BY account

or

WITH better_sales AS (SELECT substr(accountDescriptor, 5) as account, revenue FROM sales)
SELECT account, sum(revenue) as grossRevenue 
FROM better_sales
GROUP BY account

or

SELECT account, sum(revenue) as grossRevenue 
FROM sales
LATERAL JOIN (SELECT substr(accountDescriptor, 5) as account)
GROUP BY account

or

SELECT substr(accountDescriptor, 5) as account, sum(revenue) as grossRevenue
FROM sales
GROUP BY 1;
like image 118
kokosing Avatar answered Nov 05 '22 22:11

kokosing


In addition to answers from kokosing and Gordon Linoff, you can use numbers that represent the location of the grouped column name in the SELECT statement. Such approach can also provide you with better performance as described in section 8 of this AWS Blog. For example:

SELECT
    substr(accountDescriptor, 5) as account,
    sum(revenue) as grossRevenue
FROM sales,
GROUP BY 1

Note: numbering starts from one and not from zero.

Here 1 is somewhat aliased to account. The main obvious downside is that if you change ordering of you columns within SELECT than you would also need to account for that within GROUP BY:

SELECT
    sum(revenue) as grossRevenue,
    substr(accountDescriptor, 5) as account
FROM sales,
GROUP BY 2
like image 43
Ilya Kisil Avatar answered Nov 05 '22 23:11

Ilya Kisil