The plan was to get data from aws data exchange, move it to an s3 bucket then query it by aws athena for a data api. Everything works, just feels a bit slow. No matter the dataset nor the query I can't get below 2 second in athena response time. Which is a lot for an API. I checked the best practices but seems that those are also above 2 sec. So my question: Is 2 sec the minimal response time for athena? If so then I have to switch to postgres.

Athena is indeed not a low latency data store. You will very rarely see response times below one second, and often they will be considerably longer. In the general case Athena is not suitable as a backend for an API, but of course that depends on what kind of an API it is. If it's some kind of analytics service, perhaps users don't expect sub second response times? I have built APIs that use Athena that work really well, but those were services where response times in seconds were expected (and even considered fast), and I got help from the Athena team to tune our account to our workload. To understand why Athena is "slow", we can dissect what happens when you submit a query to Athena: <ol> <li>Your code starts a query by using the <code>StartQueryExecution</code> API call</li> <li>The Athena service receives the query, and puts it on a queue. If you're unlucky your query will sit in the queue for a while</li> <li>When there is available capacity the Athena service takes your query from the queue and makes a query plan</li> <li>The query plan requires loading table metadata from the Glue catalog, including the list of partitions, for all tables included in the query</li> <li>Athena also lists all the locations on S3 it got from the tables and partitions to produce a full list of files that will be processed</li> <li>The plan is then executed in parallel, and depending on its complexity, in multiple steps</li> <li>The results of the parallel executions are combined and a result is serialized as CSV and written to S3</li> <li>Meanwhile your code checks if the query has completed using the <code>GetQueryExecution</code> API call, until it gets a response that says that the execution has succeeded, failed, or been cancelled</li> <li>If the execution succeeded your code uses the <code>GetQueryResults</code> API call to retrieve the first page of results</li> <li>To respond to that API call, Athena reads the result CSV from S3, deserializes it, and serializes it as JSON for the API response</li> <li>If there are more than 1000 rows the last steps will be repeated</li> </ol> A Presto expert could probably give more detail about steps 4-6, even though they are probably a bit modified in Athena's version of Presto. The details aren't very important for this discussion though. If you run a query over a lot of data, tens of gigabytes or more, the total execution time will be dominated by step 6. If the result is also big, 7 will be a factor. If your data set is small, and/or involves thousands of files on S3, then 4-5 will instead dominate. Here are some reasons why Athena queries can never be fast, even if they wouldn't touch S3 (for example <code>SELECT NOW()</code>): <ul> <li>There will at least be three API calls before you get the response, a <code>StartQueryExecution</code>, a <code>GetQueryExecution</code>, and a <code>GetQueryResults</code>, just their round trip time (RTT) would add up to more than 100ms.</li> <li>You will most likely have to call <code>GetQueryExecution</code> multiple times, and the delay between calls will puts a bound on how quickly you can discover that the query has succeeded, e.g. if you call it every 100ms you will on average add half of 100ms + RTT to the total time because on average you'll miss the actual completion time by this much.</li> <li>Athena will writes the results to S3 before it marks the execution as succeeded, and since it produces a single CSV file this is not done in parallel. A big response takes time to write.</li> <li>The <code>GetQueryResults</code> must read the CSV from S3, parse it and serialize it as JSON. Subsequent pages must skip ahead in the CSV, and may be even slower.</li> <li>Athena is a multi tenant service, all customers are competing for resources, and your queries will get queued when there aren't enough resources available.</li> </ul> If you want to know what affects the performance of your queries you can use the <code>ListQueryExecutions</code> API call to list recent query execution IDs (I think you can go back 90 days at the most), and then use <code>GetQueryExecution</code> to get query statistics (see the documentation for <code>QueryExecution.Statistics</code> for what each property means). With this information you can figure out if your slow queries are because of queueing, execution, or the overhead of making the API calls (if it's not the first two, it's likely the last). There are some things you can do to cut some of the delays, but these tips are unlikely to get you down to sub second latencies: <ul> <li>If you query a lot of data use file formats that are optimized for that kind of thing, Parquet is almost always the answer – and also make sure your file sizes are optimal, around 100 MB.</li> <li>Avoid lots of files, and avoid deep hierarchies. Ideally have just one or a few files per partition, and don't organize files in "subdirectories" (S3 prefixes with slashes) except for those corresponding to partitions.</li> <li>Avoid running queries at the top of the hour, this is when everyone else's scheduled jobs run, there's significant contention for resources the first minutes of every hour.</li> <li>Skip <code>GetQueryExecution</code>, download the CSV from S3 directly. The <code>GetQueryExecution</code> call is convenient if you want to know the data types of the columns, but if you already know, or don't care, reading the data directly can save you some precious tens of milliseconds. If you need the column data types you can get the <code>….csv.metadata</code> file that is written alongside the result CSV, it's undocumented Protobuf data, see here and here for more information.</li> <li>Ask the Athena service team to tune your account. This might not be something you can get without higher tiers of support, I don't really know the politics of this and you need to start by talking to your account manager.</li> </ul>

AWS Athena too slow for an api?

1 Answers

Athena is indeed not a low latency data store. You will very rarely see response times below one second, and often they will be considerably longer. In the general case Athena is not suitable as a backend for an API, but of course that depends on what kind of an API it is. If it's some kind of analytics service, perhaps users don't expect sub second response times? I have built APIs that use Athena that work really well, but those were services where response times in seconds were expected (and even considered fast), and I got help from the Athena team to tune our account to our workload.

To understand why Athena is "slow", we can dissect what happens when you submit a query to Athena:

Your code starts a query by using the StartQueryExecution API call
The Athena service receives the query, and puts it on a queue. If you're unlucky your query will sit in the queue for a while
When there is available capacity the Athena service takes your query from the queue and makes a query plan
The query plan requires loading table metadata from the Glue catalog, including the list of partitions, for all tables included in the query
Athena also lists all the locations on S3 it got from the tables and partitions to produce a full list of files that will be processed
The plan is then executed in parallel, and depending on its complexity, in multiple steps
The results of the parallel executions are combined and a result is serialized as CSV and written to S3
Meanwhile your code checks if the query has completed using the GetQueryExecution API call, until it gets a response that says that the execution has succeeded, failed, or been cancelled
If the execution succeeded your code uses the GetQueryResults API call to retrieve the first page of results
To respond to that API call, Athena reads the result CSV from S3, deserializes it, and serializes it as JSON for the API response
If there are more than 1000 rows the last steps will be repeated

A Presto expert could probably give more detail about steps 4-6, even though they are probably a bit modified in Athena's version of Presto. The details aren't very important for this discussion though.

If you run a query over a lot of data, tens of gigabytes or more, the total execution time will be dominated by step 6. If the result is also big, 7 will be a factor.

If your data set is small, and/or involves thousands of files on S3, then 4-5 will instead dominate.

Here are some reasons why Athena queries can never be fast, even if they wouldn't touch S3 (for example SELECT NOW()):

There will at least be three API calls before you get the response, a StartQueryExecution, a GetQueryExecution, and a GetQueryResults, just their round trip time (RTT) would add up to more than 100ms.
You will most likely have to call GetQueryExecution multiple times, and the delay between calls will puts a bound on how quickly you can discover that the query has succeeded, e.g. if you call it every 100ms you will on average add half of 100ms + RTT to the total time because on average you'll miss the actual completion time by this much.
Athena will writes the results to S3 before it marks the execution as succeeded, and since it produces a single CSV file this is not done in parallel. A big response takes time to write.
The GetQueryResults must read the CSV from S3, parse it and serialize it as JSON. Subsequent pages must skip ahead in the CSV, and may be even slower.
Athena is a multi tenant service, all customers are competing for resources, and your queries will get queued when there aren't enough resources available.

If you want to know what affects the performance of your queries you can use the ListQueryExecutions API call to list recent query execution IDs (I think you can go back 90 days at the most), and then use GetQueryExecution to get query statistics (see the documentation for QueryExecution.Statistics for what each property means). With this information you can figure out if your slow queries are because of queueing, execution, or the overhead of making the API calls (if it's not the first two, it's likely the last).

There are some things you can do to cut some of the delays, but these tips are unlikely to get you down to sub second latencies:

If you query a lot of data use file formats that are optimized for that kind of thing, Parquet is almost always the answer – and also make sure your file sizes are optimal, around 100 MB.
Avoid lots of files, and avoid deep hierarchies. Ideally have just one or a few files per partition, and don't organize files in "subdirectories" (S3 prefixes with slashes) except for those corresponding to partitions.
Avoid running queries at the top of the hour, this is when everyone else's scheduled jobs run, there's significant contention for resources the first minutes of every hour.
Skip GetQueryExecution, download the CSV from S3 directly. The GetQueryExecution call is convenient if you want to know the data types of the columns, but if you already know, or don't care, reading the data directly can save you some precious tens of milliseconds. If you need the column data types you can get the ….csv.metadata file that is written alongside the result CSV, it's undocumented Protobuf data, see here and here for more information.
Ask the Athena service team to tune your account. This might not be something you can get without higher tiers of support, I don't really know the politics of this and you need to start by talking to your account manager.

answered Oct 05 '22 05:10

Theo

Related questions
                            
                                My mounted EBS volume is not showing up
                            
                                Passing environment variables to Docker containers
                            
                                How to create download link for an Amazon S3 bucket's object?
                            
                                Amazon S3 bucket policies don't support "version" option
                            
                                Using AWS Gateway API, can I access the cookies?
                            
                                Amazon Athena and compressed S3 files
                            
                                How do I read a csv stored in S3 with csv.DictReader?
                            
                                TOKEN endpoint returns invalid_client without client secret
                            
                                Where can you change the batch size for an SQS queue that triggers an AWS Lambda function?
                            
                                AWS Elastic Beanstalk environment with multiple Load Balancers
                            
                                Download an entire folder from AWS sagemaker to laptop
                            
                                Amazon Product Advertising API: Get Average Customer Rating
                            
                                Why can't my ECS service register available EC2 instances with my ELB?
                            
                                AWS S3 Standard Infrequent Access vs Reduced Redundancy storage class when coupled with CloudFront?
                            
                                Certificate in Pending state in AWS Certificate Manager
                            
                                Is there a library for iPhone to work with HMAC-SHA-1 encoding
                            
                                Has anyone been successful deploying a node (express) app with Amazon OpsWorks?
                            
                                Disabling AWS RDS backups when creating/updating instances?
                            
                                How to know RDS free storage
                            
                                Deploying Common Lisp Web Applications

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

AWS Athena too slow for an api?

Tags:

amazon-web-services

amazon-athena

athomas

People also ask

1 Answers

Theo

Recent Activity

Donate For Us