Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Do Parquet Metadata Files Need to be Rolled-back?

When a Parquet file data is written with partitioning on its date column we get a directory structure like:

/data
    _common_metadata
    _metadata
    _SUCCESS
    /date=1
        part-r-xxx.gzip
        part-r-xxx.gzip
    /date=2
        part-r-xxx.gzip
        part-r-xxx.gzip

If the partition date=2 is deleted without the involvement of Parquet utilities (via the shell or file browser, etc) do any of the metadata files need to be rolled back to when there was only the partition date=1?

Or is it ok to delete partitions at will and rewrite them (or not) later?

like image 880
BAR Avatar asked Oct 04 '15 18:10

BAR


People also ask

How is data stored in Parquet format?

Each block in the parquet file is stored in the form of row groups. So, data in a parquet file is partitioned into multiple row groups. These row groups in turn consists of one or more column chunks which corresponds to a column in the dataset. The data for each column chunk is then written in the form of pages.

Are Parquet files immutable?

Background. Historically Parquet files have been viewed as immutable, and for good reason. You incur significant costs for structuring, compressing and writing out a parquet file. It is better to append data via new parquet files rather than incur the cost of a complete rewrite.

Why do we partition Parquet files?

As a reminder, Parquet files are partitioned. When we say “Parquet file”, we are actually referring to multiple physical files, each of them being a partition. This directory structure makes it easy to add new data every day, but it only works well when you make time-based analysis.


1 Answers

If you're using DataFrame there is no need to roll back the metadata files.

For example:

You can write your DataFrame to S3

df.write.partitionBy("date").parquet("s3n://bucket/folderPath")

Then, manually delete one of your partitions (date=1 folder in S3) using S3 browser (e.g. CloudBerry)

Now you can

  • Load your data and see that the data is still valid except the data you had in partition date=1 sqlContext.read.parquet("s3n://bucket/folderPath").count

  • Or rewrite your DataFrame (or any other DataFrame with the same schema) using append mode

    df2.write.mode("append").partitionBy("date").parquet("s3n://bucket/folderPath")
    

You can also take a look at this question from databricks forum.

like image 54
Nadav Avatar answered Oct 26 '22 15:10

Nadav