The plan was to get data from aws data exchange, move it to an s3 bucket then query it by aws athena for a data api. Everything works, just feels a bit slow.
No matter the dataset nor the query I can't get below 2 second in athena response time. Which is a lot for an API. I checked the best practices but seems that those are also above 2 sec.
So my question: Is 2 sec the minimal response time for athena?
If so then I have to switch to postgres.
Athena Performance Issues Unlike full database products, it does not have its own optimized storage layer. Therefore its performance is strongly dependent on how data is organized in S3—if data is sorted to allow efficient metadata based filtering, it will perform fast, and if not, some queries may be very slow.
Your queries might be temporarily queued before they run. Queries generally take a long time to run because of either a higher queuing time or a higher engine processing time.
Use Compression Compressing your data can speed up your queries significantly, as long as the files are either of an optimal size (see the next section), or the files are splittable. The smaller data sizes reduce the data scanned from Amazon S3, resulting in lower costs of running queries.
You can now run queries on Amazon Athena via the REST API using the AWS SDK available in Java, . NET, Node. js, PHP, Python, Ruby, Go, and C++.
Athena is indeed not a low latency data store. You will very rarely see response times below one second, and often they will be considerably longer. In the general case Athena is not suitable as a backend for an API, but of course that depends on what kind of an API it is. If it's some kind of analytics service, perhaps users don't expect sub second response times? I have built APIs that use Athena that work really well, but those were services where response times in seconds were expected (and even considered fast), and I got help from the Athena team to tune our account to our workload.
To understand why Athena is "slow", we can dissect what happens when you submit a query to Athena:
StartQueryExecution
API callGetQueryExecution
API call, until it gets a response that says that the execution has succeeded, failed, or been cancelledGetQueryResults
API call to retrieve the first page of resultsA Presto expert could probably give more detail about steps 4-6, even though they are probably a bit modified in Athena's version of Presto. The details aren't very important for this discussion though.
If you run a query over a lot of data, tens of gigabytes or more, the total execution time will be dominated by step 6. If the result is also big, 7 will be a factor.
If your data set is small, and/or involves thousands of files on S3, then 4-5 will instead dominate.
Here are some reasons why Athena queries can never be fast, even if they wouldn't touch S3 (for example SELECT NOW()
):
StartQueryExecution
, a GetQueryExecution
, and a GetQueryResults
, just their round trip time (RTT) would add up to more than 100ms.GetQueryExecution
multiple times, and the delay between calls will puts a bound on how quickly you can discover that the query has succeeded, e.g. if you call it every 100ms you will on average add half of 100ms + RTT to the total time because on average you'll miss the actual completion time by this much.GetQueryResults
must read the CSV from S3, parse it and serialize it as JSON. Subsequent pages must skip ahead in the CSV, and may be even slower.If you want to know what affects the performance of your queries you can use the ListQueryExecutions
API call to list recent query execution IDs (I think you can go back 90 days at the most), and then use GetQueryExecution
to get query statistics (see the documentation for QueryExecution.Statistics
for what each property means). With this information you can figure out if your slow queries are because of queueing, execution, or the overhead of making the API calls (if it's not the first two, it's likely the last).
There are some things you can do to cut some of the delays, but these tips are unlikely to get you down to sub second latencies:
GetQueryExecution
, download the CSV from S3 directly. The GetQueryExecution
call is convenient if you want to know the data types of the columns, but if you already know, or don't care, reading the data directly can save you some precious tens of milliseconds. If you need the column data types you can get the ….csv.metadata
file that is written alongside the result CSV, it's undocumented Protobuf data, see here and here for more information.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With