I have a daily ingestion of data in to HDFS . From data into HDFS I generate Hive external tables partitioned by date . My qestion is as follows , should I run MSCK REPAIR TABLE tablename after each data ingestion , in this case I have to run the command each day. Or running it just one time at the table creation is enough . Thanks a lot for your answers
Best regards
The MSCK REPAIR TABLE command scans a file system such as Amazon S3 for Hive compatible partitions that were added to the file system after the table was created. MSCK REPAIR TABLE compares the partitions in the table metadata and the partitions in S3.
Similar to how fsck stands for filesystem consistency check, msck is Hive's metastore consistency check. Follow this answer to receive notifications.
MSCK REPAIR TABLE recovers all the partitions in the directory of a table and updates the Hive metastore. When creating a table using PARTITIONED BY clause, partitions are generated and registered in the Hive metastore.
You only run MSCK REPAIR TABLE
while the structure or partition of the external table is changed. This command updates the metadata of the table.
One example that usually happen, e.g.
You use a field dt
which represent a date to partition the table.
dt=2018-06-12
, then you should run MSCK REPAIR TABLE
to update the metadata to tell hive to aware a new partition dt=2018-06-12
.dt=2018-06-13
, then you should run MSCK REPAIR TABLE
to update the metadata to tell hive to aware a new partition dt=2018-06-13
.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With