Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OFFSET on AWS Athena

I would like to run a query on AWS Athena with both a LIMIT and an OFFSET clause. I take it the former is supported while the latter is not. Is there any way of emulating this functionality using other methods?

like image 391
RoyalTS Avatar asked Jul 15 '17 03:07

RoyalTS


People also ask

Does Athena support offset?

There is no OFFSET support in AWS Athena, but we can use a workaround to get the same behavior.

How do you use limits in Athena?

You can create only one per-query control limit in a workgroup and it applies to each query that runs in it. Edit the limit if you need to change it. Open the Athena console at https://console.aws.amazon.com/athena/ . If the console navigation pane is not visible, choose the expansion menu on the left.

Why is AWS Athena so slow?

Athena Performance Issues Unlike full database products, it does not have its own optimized storage layer. Therefore its performance is strongly dependent on how data is organized in S3—if data is sorted to allow efficient metadata based filtering, it will perform fast, and if not, some queries may be very slow.

How do I remove duplicates in Athena?

We can not remove duplicate in Athena as it works on file it have work arrounds. So some how duplicate record should be deleted from files in s3, most easy way would be shellscript. Write select query with distinct option. Note: Both are costly operations.


1 Answers

Using OFFSET for pagination is very inefficient, especially for an analytic database like Presto that often has to perform a full table or partition scan. Additionally, the results will not necessarily be consistent between queries, so you can have duplicate or missing results when navigating between pages.

In an OLTP database like MySQL or PostgreSQL, it's better to use a range query over an index, where you keep track of the last value seen on the previous page.

In an OLAP database like Presto, it's better to cache the result set and perform pagination using the cached data. You don't want to run an expensive query over billions or trillions of rows each time the user clicks to go to a different page.

See these articles for a longer explanation of the problem and the index approach:

  • http://use-the-index-luke.com/no-offset
  • http://use-the-index-luke.com/sql/partial-results/fetch-next-page
like image 183
David Phillips Avatar answered Sep 30 '22 18:09

David Phillips