Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Athena MSCK REPAIR TABLE takes too long for a small dataset

I am having issues with amazon athena, I have a small bucket ( 36430 objects , 9.7 mb ) with 4 levels of partition ( my-bucket/p1=ab/p2=cd/p3=ef/p4=gh/file.csv ) but when I run the command

MSCK REPAIR TABLE db.table

is taking over 25 minutes, and I have plans to put data of the magnitude of TB on Athena and I won't do it if this issue remains

Does anybody know why is taking too long?

Thanks in advance

like image 643
JorgeGarza Avatar asked Jul 14 '17 16:07

JorgeGarza


2 Answers

While the marked answer is technically correct, it doesn't address your real issue, which is that you have too many files.

I have a small bucket ( 36430 objects , 9.7 mb ) with 4 levels of partition ( my-bucket/p1=ab/p2=cd/p3=ef/p4=gh/file.csv )

For such a small table, 36430 files creates a huge amount of overhead on S3, and the partitioning with 4 levels is super-overkill. The partitioning has hindered query performance rather than optimizing it. MSCK is slow because it is waiting for S3 listing among other things.

Athena would read the entire 9.7MB table if it were in one file faster than it would be able to list that huge directory structure.

I recommend removing the partitions completely, or if you really must have them then remove p2, p3 and p4 levels. Also consider processing it into another table to compact the files into larger ones.

Some suggest optimal file sizes are between 64MB and 4GB, which relates to the native block sizes on S3. It's also helpful to have a number of files that is some multiple of the workers in the cluster, although that is unknown with Athena. Your data is smaller than that range, so 1 or perhaps 8 files at most would be appropriate.

Some references: https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/#OptimizeFileSizes

https://www.upsolver.com/blog/small-file-problem-hdfs-s3

like image 104
Davos Avatar answered Oct 05 '22 03:10

Davos


MSCK REPAIR TABLE can be a costly operation, because it needs to scan the table's sub-tree in the file system (the S3 bucket). Multiple levels of partitioning can make it more costly, as it needs to traverse additional sub-directories. Assuming all potential combinations of partition values occur in the data set, this can turn into a combinatorial explosion.

If you are adding new partitions to an existing table, then you may find that it's more efficient to run ALTER TABLE ADD PARTITION commands for the individual new partitions. This avoids the need to scan the table's entire sub-tree in the file system. It is less convenient than simply running MSCK REPAIR TABLE, but sometimes the optimization is worth it. A viable strategy is often to use MSCK REPAIR TABLE for an initial import, and then use ALTER TABLE ADD PARTITION for ongoing maintenance as new data gets added into the table.

If it's really not feasible to use ALTER TABLE ADD PARTITION to manage the partitions directly, then the execution time might be unavoidable. Reducing the number of partitions might reduce execution time, because it won't need to traverse as many directories in the file system. Of course, then the partitioning is different, which might impact query execution time, so it's a trade-off.

like image 25
Chris Nauroth Avatar answered Oct 05 '22 04:10

Chris Nauroth