Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Athena partition fetch all paths

Recently, I've experienced an issue with AWS Athena when there is quite high number of partitions.

The old version had a database and tables with only 1 partition level, say id=x. Let's take one table; for example, where we store payment parameters per id (product), and there are not plenty of IDs. Assume its around 1000-5000. Now while querying that table with passing id number on where clause like ".. where id = 10". The queries were returned pretty fast actually. Assume we update the data twice a day.

Lately, we've been thinking to add another partition level for day like, "../id=x/dt=yyyy-mm-dd/..". This means that partition number grows xID times per day if a month passes and if we have 3000 IDs, we'd approximately get 3000x30=90000 partitions a month. Thus, a rapid grow in number of partitions.

On, say 3 months old data (~270k partitions), we'd like to see a query like the following would return in at most 20 seconds or so.

select count(*) from db.table where id = x and dt = 'yyyy-mm-dd'

This takes like a minute.

The Real Case

It turns out Athena first fetches the all partitions (metadata) and s3 paths (regardless the usage of where clause) and then filter those s3 paths that you would like to see on where condition. The first part (fetching all s3 paths by partitions lasts long proportionally to the number of partitions)

The more partitions you have, the slower the query executed.

Intuitively, I expected that Athena fetches only s3 paths stated on where clause, I mean this would be the one way of magic of the partitioning. Maybe it fetches all paths

  • Does anybody know a work around, or do we use Athena in a wrong way ?
  • Should Athena be used only with small number of partitions ?

Edit

In order to clarify the statement above, I add a piece from support mail.

from Support

... You mentioned that your new system has 360000 which is a huge number. So when you are doing select * from <partitioned table>, Athena first download all partition metadata and searched S3 path mapped with those partitions. This process of fetching data for each partition lead to longer time in query execution. ...

Update

An issue opened on AWS forums. The linked issue raised on aws forums is here.

Thanks.

like image 432
null Avatar asked Dec 26 '19 12:12

null


People also ask

How many partitions can an Athena table have?

Although Athena supports querying AWS Glue tables that have 10 million partitions, Athena cannot read more than 1 million partitions in a single scan. In such scenarios, partition indexing can be beneficial.

Can Athena query across databases?

With the federated query functionality in Athena, you can now run SQL queries across data stored in relational, non-relational, object, and custom data sources and store the results back in Amazon S3 for further analysis.

Can Athena query parquet files?

Querying the Parquet file from AWS Athena Now that the data and the metadata are created, we can use AWS Athena to query the parquet file. Choose the Athena service in the AWS Console. Before you can proceed, Athena will require you to set up a Query Results Location.

Does Athena cache query results?

Amazon Athena automatically stores query results and metadata information for each query that runs in a query result location that you can specify in Amazon S3. If necessary, you can access the files in this location to work with them.


1 Answers

This is impossible to properly answer without knowing the amount of data, what file formats, and how many files we're talking about.

TL; DR I suspect you have partitions with thousands of files and that the bottleneck is listing and reading them all.

For any data set that grows over time you should have a temporal partitioning, on date or even time, depending on query patterns. If you should have partitioning on other properties depends on a lot of factors and in the end it often turns out that not partitioning is better. Not always, but often.

Using reasonably sized (~100 MB) Parquet can in many cases be more effective than partitioning. The reason is that partitioning increases the number of prefixes that have to be listed on S3, and the number of files that have to be read. A single 100 MB Parquet file can be more efficient than ten 10 MB files in many cases.

When Athena executes a query it will first load partitions from Glue. Glue supports limited filtering on partitions, and will help a bit in pruning the list of partitions – so to the best of my knowledge it's not true that Athena reads all partition metadata.

When it has the partitions it will issue LIST operations to the partition locations to gather the files that are involved in the query – in other words, Athena won't list every partition location, just the ones in partitions selected for the query. This may still be a large number, and these list operations are definitely a bottleneck. It becomes especially bad if there is more than 1000 files in a partition because that's the page size of S3's list operations, and multiple requests will have to be made sequentially.

With all files listed Athena will generate a list of splits, which may or may not equal the list of files – some file formats are splittable, and if files are big enough they are split and processed in parallel.

Only after all of that work is done the actual query processing starts. Depending on the total number of splits and the amount of available capacity in the Athena cluster your query will be allocated resources and start executing.

If your data was in Parquet format, and there was one or a few files per partition, the count query in your question should run in a second or less. Parquet has enough metadata in the files that a count query doesn't have to read the data, just the file footer. It's hard to get any query to run in less than a second due to the multiple steps involved, but a query hitting a single partition should run quickly.

Since it takes two minutes I suspect you have hundreds of files per partition, if not thousands, and your bottleneck is that it takes too much time to run all the list and get operations in S3.

like image 136
Theo Avatar answered Oct 17 '22 06:10

Theo