Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Partition data for AWS Athena results in a lot of small files in S3

I have a large dataset (>40G) which I want to store in S3 and then use Athena for query.

As suggested by this blog post, I could store my data in the following hierarchical directory structure to enable usingMSCK REPAIR to automatically add partitions while creating table from my dataset.

s3://yourBucket/pathToTable/<PARTITION_COLUMN_NAME>=<VALUE>/<PARTITION_COLUMN_NAME>=<VALUE>/

However, this requires me to split my dataset into many smaller data files and each will be stored under a nested folder depending on the partition keys.

Although using partition could reduce amount of data to be scanned by Athena and therefore speed up a query, would managing large amount of small files cause performance issue for S3? Is there a tradeoff here I need to consider?

like image 908
panc Avatar asked Feb 09 '18 15:02

panc


2 Answers

Yes, you may experience an important decrease of efficiency with small files and lots of partitions.

Here there is a good explanation and suggestion on file sizes and number of partitions, which should be larger than 128 MB to compensate the overhead.

Also, I performed some experiments in a very small dataset (1 GB), partitioning my data by minute, hour and day. The scanned data decreases when you make the partitions smaller, but the time spent on the query will increase a lot (40 times slower in some experiments).

like image 141
Gonzalo Solera Avatar answered Oct 14 '22 08:10

Gonzalo Solera


I will try to get into it without veering too much into the realm of opinion.

For the use cases which I have used Athena, 40 GB is actually a very small dataset by the standards of what the underlying technology (Presto) is designed to handle. According to the Presto web page, Facebook uses the underlying technology to query their 300 PB data warehouse. I routinely use it on datasets between 500 GB and 1 TB in size.

Considering the underlying S3 technology, S3 was used to host Dropbox and Netflix, so I doubt most enterprises could come anywhere near taxing the storage infrastructure. Where you may have heard about performance issues and S3 relates to websites storing multiple, small, pieces of static content on many files scattered across S3. In this case, a delay in retrieving one of these small pieces of content might affect user experience on the larger site.

Related Reading: Presto

like image 35
Zerodf Avatar answered Oct 14 '22 07:10

Zerodf