Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Athena concurrency limits: Number of submitted queries VS number of running queries

According to AWS Athena limitations you can submit up to 20 queries of the same type at a time, but it is a soft limit and can be increased on request. I use boto3 to interact with Athena and my script submits 16 CTAS queries each of which takes about 2 minutes to finish. In a AWS account, it is only me who is using Athena service. However, when I look at the state of queries through console I see that only a few of queries (5 on average) are actually being executed despite all of them being in state Running. Here is what would normally see in Athena hisotry tab:

Athena hisotry tab

I understand that, after I submit queries to Athena, it processes the queries by assigning resources based on the overall service load and the amount of incoming requests. But I tried to run them at different days and hours, still would get about 5 queries being executed at the same time.

So my question is this how it supposed to be? If it is then what is the point of being able to submit up to 20 queries if roughly 15 of them would be idling and waiting for available slots.

Update 2019-09-26

Just stumbled across HIVE CONNECTOR in presto documentation, which has a section AWS Glue Catalog Configuration Properties. There we can see

hive.metastore.glue.max-connections: Max number of concurrent connections to Glue (defaults to 5).

This got me wonder if it has something to do with my issue. As I understand, Athena is simply a Presto that runs on EMR cluster which is configured to use AWS Glue Data Catalog as the Metastore.

So what if my issue comes from the fact that EMR cluster for Athena simply uses default value for concurrent connections to Glue, which is 5 which and is exactly of how many concurrent queries are actually getting executed (on average) in my case.

Update 2019-11-27

The Athena team recently deployed a host of new functionality for Athena. although QUEUED has been in the state enum for some time is hasn't been used until now. So now I get, correct info about query state in a history tab, but everything else remains the same.

enter image description here

Also, another post was published with similar problem.

like image 821
Ilya Kisil Avatar asked Jul 22 '19 12:07

Ilya Kisil


People also ask

What are the limitations of Athena?

AWS Athena partition limits Athena's users can use AWS Glue, a data catalog and ETL service. Athena's partition limit is 20,000 per table and Glue's limit is 1,000,000 partitions per table. A Create Table As (CTAS) or INSERT INTO query can only create up to 100 partitions in a destination table.

Can we run multiple queries in Athena?

Open the Amazon Athena console at https://console.aws.amazon.com/athena/ . In the left navigation pane, choose Workflows. In the Execute multiple queries tile, choose Get started. In the Get started dialog box, choose Deploy a sample project, and then choose Continue.


1 Answers

Your account's limits for the Athena service is not an SLA, it's more of a priority in the query scheduler.

Depending on available capacity your queries may be queued even though you're not running any other queries. Exactly what a higher concurrency limit means is internal and could change, but in my experience it's best to think of it as the priority by which he query scheduler will deal with your query. Queries for all accounts run in the same server pool(s) and if everyone is running queries there will not be any capacity left for you.

You can see this in action by running the same query over and over again and then plot the query execution metrics over time, you will notice that they vary a lot, and you will notice spikes in the time your queries are queued on the top of every hour – when everyone else is running their scheduled queries.

like image 53
Theo Avatar answered Sep 22 '22 20:09

Theo