When a Parquet file data
is written with partitioning on its date
column we get a directory structure like:
/data
_common_metadata
_metadata
_SUCCESS
/date=1
part-r-xxx.gzip
part-r-xxx.gzip
/date=2
part-r-xxx.gzip
part-r-xxx.gzip
If the partition date=2
is deleted without the involvement of Parquet utilities (via the shell or file browser, etc) do any of the metadata files need to be rolled back to when there was only the partition date=1
?
Or is it ok to delete partitions at will and rewrite them (or not) later?
Each block in the parquet file is stored in the form of row groups. So, data in a parquet file is partitioned into multiple row groups. These row groups in turn consists of one or more column chunks which corresponds to a column in the dataset. The data for each column chunk is then written in the form of pages.
Background. Historically Parquet files have been viewed as immutable, and for good reason. You incur significant costs for structuring, compressing and writing out a parquet file. It is better to append data via new parquet files rather than incur the cost of a complete rewrite.
As a reminder, Parquet files are partitioned. When we say “Parquet file”, we are actually referring to multiple physical files, each of them being a partition. This directory structure makes it easy to add new data every day, but it only works well when you make time-based analysis.
If you're using DataFrame there is no need to roll back the metadata files.
For example:
You can write your DataFrame to S3
df.write.partitionBy("date").parquet("s3n://bucket/folderPath")
Then, manually delete one of your partitions (date=1 folder in S3) using S3 browser (e.g. CloudBerry)
Now you can
Load your data and see that the data is still valid except the data you had in partition date=1 sqlContext.read.parquet("s3n://bucket/folderPath").count
Or rewrite your DataFrame (or any other DataFrame with the same schema) using append mode
df2.write.mode("append").partitionBy("date").parquet("s3n://bucket/folderPath")
You can also take a look at this question from databricks forum.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With