Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

reduce the amount of data scanned by Athena when using aggregate functions

The below query scans 100 mb of data.

select * from table where column1 = 'val' and partition_id = '20190309';

However the below query scans 15 GB of data (there are over 90 partitions)

select * from table where column1 = 'val' and partition_id in (select max(partition_id) from table);

How can I optimize the second query to scan the same amount of data as the first?

like image 587
Punter Vicky Avatar asked Dec 10 '22 03:12

Punter Vicky


1 Answers

There are two problems here. The efficiency of the the scalar subquery above select max(partition_id) from table, and the one @PiotrFindeisen pointed out around dynamic filtering.

The the first problem is that queries over the partition keys of a Hive table are a lot more complex than they appear. Most folks would think that if you want the max value of a partition key, you can simply execute a query over the partition keys, but that doesn't work because Hive allows partitions to be empty (and it also allows non-empty files that contain no rows). Specifically, the scalar subquery above select max(partition_id) from table requires Trino (formerly PrestoSQL) to find the max partition containing at least one row. The ideal solution would be to have perfect stats in Hive, but short of that the engine would need to have custom logic for hive that open files of the partitions until it found a non empty one.

If you are are sure that your warehouse does not contain empty partitions (or if you are ok with the implications of that), you can replace the scalar sub query with one over the hidden $partitions table"

select * 
from table 
where column1 = 'val' and 
    partition_id = (select max(partition_id) from "table$partitions");

The second problem is the one @PiotrFindeisen pointed out, and has to do with the way that queries are planned an executed. Most people would look at the above query, see that the engine should obviously figure out the value of select max(partition_id) from "table$partitions" during planning, inline that into the plan, and then continue with optimization. Unfortunately, that is a pretty complex decision to make generically, so the engine instead simply models this as a broadcast join, where one part of the execution figures out that value, and broadcasts the value to the rest of the workers. The problem is the rest of the execution has no way to add this new information into the existing processing, so it simply scans all of the data and then filters out the values you are trying to skip. There is a project in progress to add this dynamic filtering, but it is not complete yet.

This means the best you can do today, is to run two separate queries: one to get the max partition_id and a second one with the inlined value.

BTW, the hidden "$partitions" table was added in Presto 0.199, and we fixed some minor bugs in 0.201. I'm not sure which version Athena is based on, but I believe it is is pretty far out of date (the current release at the time I'm writing this answer is 309.

like image 135
Dain Sundstrom Avatar answered Jan 02 '23 12:01

Dain Sundstrom