I know that MSCK REPAIR TABLE
updates the metastore with the current partitions of an external table.
To do that, you only need to do ls
on the root folder of the table (given the table is partitioned by only one column), and get all its partitions, clearly a < 1s operation.
But in practice, the operation can take a very long time to execute (or even timeout if ran on AWS Athena).
So my question is, what does MSCK REPAIR TABLE
actually do behind the scenes and why?
How does MSCK REPAIR TABLE find the partitions?
Additional data in case it's relevant:
Our data is all on S3, it's both slow when running on EMR (Hive) or Athena (Presto), there are ~450 partitions in the table, every partition has on avg 90 files, overall 3 Gigabytes for a partition, files are in Apache parquet format
MSCK is slow because it is waiting for S3 listing among other things. Athena would read the entire 9.7MB table if it were in one file faster than it would be able to list that huge directory structure. I recommend removing the partitions completely, or if you really must have them then remove p2, p3 and p4 levels.
The MSCK REPAIR TABLE command scans a file system such as Amazon S3 for Hive compatible partitions that were added to the file system after the table was created. MSCK REPAIR TABLE compares the partitions in the table metadata and the partitions in S3.
ALTER TABLE RECOVER PARTITIONS is the command that is widely used in Hive to refresh partitions as new partitions are directly added to the file system by other users. Qubole has added path validation to the ALTER TABLE RECOVER PARTITIONS command. The command only recovers valid paths.
By partitioning your data, you can restrict the amount of data scanned by each query, thus improving performance and reducing cost.
You are right in the sense it reads the directory structure, creates partitions out of it and then updates the hive metastore. In fact more recently, the command was improved to remove non-existing partitions from metastore as well. The example that you are giving is very simple since it has only one level of partition keys. Consider table with multiple partition keys (2-3 partition keys is common in practice). msck repair
will have to do a full-tree traversal of all the sub-directories under the table directory, parse the file names, make sure that the file names are valid, check if the partition is already existing in the metastore and then add the only partitions which are not present in the metastore. Note that each listing on the filesystem is a RPC to the namenode (in case of HDFS) or a web-service call in case of S3 or ADLS which can add to significant amount of time. Additionally, in order to figure out if the partition is already present in metastore or not, it needs to do a full listing of all the partitions which metastore knows of for the table. Both these steps can potentially increase the time taken for the command on large tables. The performance of msck repair table was improved considerably recently Hive 2.3.0 (see HIVE-15879 for more details). You may want to tune hive.metastore.fshandler.threads
and hive.metastore.batch.retrieve.max
to improve the performance of command.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With