Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Athena + S3 limitation

I am currently using AWS S3 + Athena for a project. However, after using it for 1-2 months, I find some limitations about it. I am not sure if I do not know how to use it or it is really a limitation. But please do not ask why I choose to use it before enough research. I think that there are 2 points:

  1. It is required by the project
  2. The resources of Athena S3 and AWS is not quite centralized and its functionality keeps changing. It is difficult for me to find what Athena + S3 can do or cannot do before actually using it for some time.

Get carried away too far, now back to the topic. >_<

Currently, I am facing a problem. With the increasing data volume, the data scan size and the query are getting bigger and longer dramatically (sometimes, there is even exception happening, such as too many open files when I perform a query). However, it seems that there is only partition but not index for AWS S3 + Athena. Hence, the problems come.

Question 1:
Can I do something like an index at AWS S3 + Athena?

Question 2:
If I use a partition, it seems that only one composite key (one or more columns as labels at S3 folder) can be specified; otherwise, the data size will be doubled. Is this true?

Question 3:
Even I am willing to increase the data size, it is impossible for a table with 2 composite keys. I must have 2 Athena tables and 2 the same set of the data but at 2 types of partition in S3 in order to achieve this. Is this true?

Question 4:
For the error "too many open files", after some research, it seems that it is a OS level issue with a predefined limited number of file descriptor. My current situation is that the SQL does not have exception most of the time, but at certain period, it has exception easily. My understanding is that Amazon will have a cluster of computers (for example, 32 node servers) to serve certain number of customers, including my company and other companies. Each server has a limited number of file descriptors available and shared among all the customers. Then, at some peak periods (other companies are performing heavy queries), the available number of file descriptor will drop and this also explains why my SQL with the same set of data has exception sometimes but not always. Is this true?

Question 5:
Due to lack of index function, S3 + Athena is not supposed to perform complicated SQL queries. That means, complicated joining logic can only be done somewhere at the transformation layer before loading into S3. Is this true?

Question 6:
This question follows the previous one, Question 5. Let me use a simple example to illustrate: A reporting system is developed to display Order and Trade. The relationship is that after execution of an order, a trade will be generated.The Order_ID is the key to link up a trade and its related Order Activities. The partition is set to the date.
Now, the following data comes:
enter image description here
The requirement is that:
1. For report on day 1, only Order record O001-Place Order is shown
2. For report on day 2, only Order record O002-Order Change Order is shown
3. For report on day 3, all records, including the 4 Order records and the 1 Trade record, are shown 4. For report on day 4, only Order record O004-Remark Change is shown
For Days 1, 2 and 4, it is easy as I just display what the data comes on the same day.
However, for Day 3, I need to display all the data, some in the past and some in the future (O001-Remark Change).
In order to avoid complicated SQL, I can only do the join logic at the Transformation layer.
However, when performing the transformation on Day 3, if the party does not send me the data on Days 1 and 2, you can only find the historical files, which is not good as you never know how many days you need to search back.
Even we do the search at Athena, as the Order_ID is not at the partition, a full table scan is needed.
The above is not the worst, the worst case is that at the transformation on Day 3, O001-Remark Change on Day 4 is future data and should not be known on Day 3.
Is any better way to do that? Or AWS S3 + Athena is just not suitable for such complicated case (The above case is just a simplified ver of my current situation)?

I know that my questions are too many and quite a lot. But all of these are what I really want to clarify and there is no clear answer I can find. Any help is highly appreciated and thanks so much.

like image 457
kzfid Avatar asked Jun 25 '18 03:06

kzfid


People also ask

What are the limitations of Athena?

Athena's partition limit is 20,000 per table and Glue's limit is 1,000,000 partitions per table. A Create Table As (CTAS) or INSERT INTO query can only create up to 100 partitions in a destination table.

Can AWS Athena write to S3?

Athena writes files to source data locations in Amazon S3 as a result of the INSERT command. Each INSERT operation creates a new file, rather than appending to an existing file. The file locations depend on the structure of the table and the SELECT query, if present.

What is the difference between Athena and S3?

Amazon Athena is a service that enables data analysts to perform interactive queries in the web-based cloud storage service, Amazon Simple Storage Service (S3). Athena is used with large-scale data sets. Amazon S3 is designed for online backup and archiving of data and applications on Amazon Web Services (AWS).

Can Athena delete data from S3?

Data stored in S3 can be queried using either S3 select or Athena. In Normal practise using Athena we can insert or query data in the table, but the option to update and delete does not exist.


2 Answers

Indexes

No, Amazon Athena (and Presto, upon which it is based) does not support indexes. This is because Athena/Presto (and even Redshift) are designed for Big Data, so even an index on Big Data is also Big Data so it would not be efficient to maintain a huge index.

While traditional databases get faster with indexes, this does not apply to Big Data systems. Instead, use indexing, compression and columnar data formats to improve performance.

Partitions

Partitions are hierarchical (eg Year -> Month -> Day). The objective with partitions is to "skip over" files that don't need to be read. Therefore, they will only provide a benefit if the WHERE clause uses the partition hierarchy.

For example, using SELECT ... WHERE year=2018 will use the partition and skip all other years.

If your query does not put one of these partitioned fields in the WHERE clause, then all directories and files need to be scanned, so there is no benefit gained.

You say "the data size will be doubled", but this is not the case. All data is stored only once. Partitioning does not modify data size.

too many open files

If this is an error generated by Amazon Athena, then you should raise it with AWS Support.

Complex queries

Athena can certainly do complex queries, but it would not be the ideal platform. If you are regularly doing complex SQL queries, consider using Amazon Redshift.

Question 6: Table/Query

I am unable to follow the flow of your requirements. If you are seeking help for SQL, please create a separate question showing a sample of the data in the table and a sample of the output you are seeking.

like image 108
John Rotenstein Avatar answered Oct 13 '22 00:10

John Rotenstein


Extending John's well put answer to add a couple of points: Q1: One way to get performance is sorting the data at regular intervals, which while using columnar data formats like Orc and Parquet can give you performance benefits as it will skip most of stripes/row groups within files to give better performance.

Q3: Again partition by one set of columns and sort the date by another set would give you performance benefits on queries on either set of selectors. Q4: Too many open files: This is a presto config on which Athena is based. The more is that value, the more memory is required to store the contents to be written, until a strip is completed, hence it is limited. Athena does not provide complete isolation so you or any other user running queries which have multiple small files might be leading to this. Contact support, i doubt they can help. One of the disadvantages of having shared infrastructure :) Q5: By complicated, if you mean huge queries doing a lot of operations, theoretically Athena should be able to handle that. The amount of data that you can sort in a query though has a limitation depending upon number of nodes Amazon brings up cluster with.

Since there is a scalability limitation in Athena due to Amazon decided cluster size, it is recommended to perform ETL and then use Athena to query. Although with newer versions of Presto, this constraint is becoming less and less practical. Q6: Not my are of expertise and better suited in SQL section as Presto uses ANSI-sql standard. You can check the list of functions and operators present here https://prestodb.io/docs/current/functions.html. You can check Qubole, EMR or Starburst for managed presto distributions, if Athena's scalability serves as a hindrance for you.

like image 21
Ankit Dixit Avatar answered Oct 13 '22 00:10

Ankit Dixit