Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Athena too slow for an api?

The plan was to get data from aws data exchange, move it to an s3 bucket then query it by aws athena for a data api. Everything works, just feels a bit slow.

No matter the dataset nor the query I can't get below 2 second in athena response time. Which is a lot for an API. I checked the best practices but seems that those are also above 2 sec.

So my question: Is 2 sec the minimal response time for athena?

If so then I have to switch to postgres.

like image 361
athomas Avatar asked Apr 24 '20 15:04

athomas


People also ask

Why is AWS Athena so slow?

Athena Performance Issues Unlike full database products, it does not have its own optimized storage layer. Therefore its performance is strongly dependent on how data is organized in S3—if data is sorted to allow efficient metadata based filtering, it will perform fast, and if not, some queries may be very slow.

Why do Athena queries take so long?

Your queries might be temporarily queued before they run. Queries generally take a long time to run because of either a higher queuing time or a higher engine processing time.

How do you make Athena queries faster?

Use Compression Compressing your data can speed up your queries significantly, as long as the files are either of an optimal size (see the next section), or the files are splittable. The smaller data sizes reduce the data scanned from Amazon S3, resulting in lower costs of running queries.

Does AWS Athena support REST API?

You can now run queries on Amazon Athena via the REST API using the AWS SDK available in Java, . NET, Node. js, PHP, Python, Ruby, Go, and C++.


1 Answers

Athena is indeed not a low latency data store. You will very rarely see response times below one second, and often they will be considerably longer. In the general case Athena is not suitable as a backend for an API, but of course that depends on what kind of an API it is. If it's some kind of analytics service, perhaps users don't expect sub second response times? I have built APIs that use Athena that work really well, but those were services where response times in seconds were expected (and even considered fast), and I got help from the Athena team to tune our account to our workload.

To understand why Athena is "slow", we can dissect what happens when you submit a query to Athena:

  1. Your code starts a query by using the StartQueryExecution API call
  2. The Athena service receives the query, and puts it on a queue. If you're unlucky your query will sit in the queue for a while
  3. When there is available capacity the Athena service takes your query from the queue and makes a query plan
  4. The query plan requires loading table metadata from the Glue catalog, including the list of partitions, for all tables included in the query
  5. Athena also lists all the locations on S3 it got from the tables and partitions to produce a full list of files that will be processed
  6. The plan is then executed in parallel, and depending on its complexity, in multiple steps
  7. The results of the parallel executions are combined and a result is serialized as CSV and written to S3
  8. Meanwhile your code checks if the query has completed using the GetQueryExecution API call, until it gets a response that says that the execution has succeeded, failed, or been cancelled
  9. If the execution succeeded your code uses the GetQueryResults API call to retrieve the first page of results
  10. To respond to that API call, Athena reads the result CSV from S3, deserializes it, and serializes it as JSON for the API response
  11. If there are more than 1000 rows the last steps will be repeated

A Presto expert could probably give more detail about steps 4-6, even though they are probably a bit modified in Athena's version of Presto. The details aren't very important for this discussion though.

If you run a query over a lot of data, tens of gigabytes or more, the total execution time will be dominated by step 6. If the result is also big, 7 will be a factor.

If your data set is small, and/or involves thousands of files on S3, then 4-5 will instead dominate.

Here are some reasons why Athena queries can never be fast, even if they wouldn't touch S3 (for example SELECT NOW()):

  • There will at least be three API calls before you get the response, a StartQueryExecution, a GetQueryExecution, and a GetQueryResults, just their round trip time (RTT) would add up to more than 100ms.
  • You will most likely have to call GetQueryExecution multiple times, and the delay between calls will puts a bound on how quickly you can discover that the query has succeeded, e.g. if you call it every 100ms you will on average add half of 100ms + RTT to the total time because on average you'll miss the actual completion time by this much.
  • Athena will writes the results to S3 before it marks the execution as succeeded, and since it produces a single CSV file this is not done in parallel. A big response takes time to write.
  • The GetQueryResults must read the CSV from S3, parse it and serialize it as JSON. Subsequent pages must skip ahead in the CSV, and may be even slower.
  • Athena is a multi tenant service, all customers are competing for resources, and your queries will get queued when there aren't enough resources available.

If you want to know what affects the performance of your queries you can use the ListQueryExecutions API call to list recent query execution IDs (I think you can go back 90 days at the most), and then use GetQueryExecution to get query statistics (see the documentation for QueryExecution.Statistics for what each property means). With this information you can figure out if your slow queries are because of queueing, execution, or the overhead of making the API calls (if it's not the first two, it's likely the last).

There are some things you can do to cut some of the delays, but these tips are unlikely to get you down to sub second latencies:

  • If you query a lot of data use file formats that are optimized for that kind of thing, Parquet is almost always the answer – and also make sure your file sizes are optimal, around 100 MB.
  • Avoid lots of files, and avoid deep hierarchies. Ideally have just one or a few files per partition, and don't organize files in "subdirectories" (S3 prefixes with slashes) except for those corresponding to partitions.
  • Avoid running queries at the top of the hour, this is when everyone else's scheduled jobs run, there's significant contention for resources the first minutes of every hour.
  • Skip GetQueryExecution, download the CSV from S3 directly. The GetQueryExecution call is convenient if you want to know the data types of the columns, but if you already know, or don't care, reading the data directly can save you some precious tens of milliseconds. If you need the column data types you can get the ….csv.metadata file that is written alongside the result CSV, it's undocumented Protobuf data, see here and here for more information.
  • Ask the Athena service team to tune your account. This might not be something you can get without higher tiers of support, I don't really know the politics of this and you need to start by talking to your account manager.
like image 70
Theo Avatar answered Oct 05 '22 05:10

Theo