When a Parquet file <code>data</code> is written with partitioning on its <code>date</code> column we get a directory structure like: <pre class="prettyprint"><code>/data _common_metadata _metadata _SUCCESS /date=1 part-r-xxx.gzip part-r-xxx.gzip /date=2 part-r-xxx.gzip part-r-xxx.gzip </code></pre> If the partition <code>date=2</code> is deleted without the involvement of Parquet utilities (via the shell or file browser, etc) do any of the metadata files need to be rolled back to when there was only the partition <code>date=1</code>? Or is it ok to delete partitions at will and rewrite them (or not) later?

If you're using DataFrame there is no need to roll back the metadata files. For example: You can write your DataFrame to S3 <pre class="prettyprint"><code>df.write.partitionBy("date").parquet("s3n://bucket/folderPath") </code></pre> Then, manually delete one of your partitions (date=1 folder in S3) using S3 browser (e.g. CloudBerry) Now you can <ul> <li>Load your data and see that the data is still valid except the data you had in partition date=1 <code>sqlContext.read.parquet("s3n://bucket/folderPath").count</code></li> <li> Or rewrite your DataFrame (or any other DataFrame with the same schema) using append mode <pre class="prettyprint"><code>df2.write.mode("append").partitionBy("date").parquet("s3n://bucket/folderPath") </code></pre> </li> </ul> You can also take a look at this question from databricks forum.

Do Parquet Metadata Files Need to be Rolled-back?

Tags:

apache-spark

parquet

spark-streaming

When a Parquet file data is written with partitioning on its date column we get a directory structure like:

/data
    _common_metadata
    _metadata
    _SUCCESS
    /date=1
        part-r-xxx.gzip
        part-r-xxx.gzip
    /date=2
        part-r-xxx.gzip
        part-r-xxx.gzip

If the partition date=2 is deleted without the involvement of Parquet utilities (via the shell or file browser, etc) do any of the metadata files need to be rolled back to when there was only the partition date=1?

Or is it ok to delete partitions at will and rewrite them (or not) later?

880

asked Oct 04 '15 18:10

BAR

1 Answers

If you're using DataFrame there is no need to roll back the metadata files.

For example:

You can write your DataFrame to S3

df.write.partitionBy("date").parquet("s3n://bucket/folderPath")

Then, manually delete one of your partitions (date=1 folder in S3) using S3 browser (e.g. CloudBerry)

Now you can

Load your data and see that the data is still valid except the data you had in partition date=1 sqlContext.read.parquet("s3n://bucket/folderPath").count
Or rewrite your DataFrame (or any other DataFrame with the same schema) using append mode
```
df2.write.mode("append").partitionBy("date").parquet("s3n://bucket/folderPath")
```

You can also take a look at this question from databricks forum.

answered Oct 26 '22 15:10

Nadav

Related questions
                            
                                Keep only duplicates from a DataFrame regarding some field
                            
                                how to cast all columns of dataframe to string
                            
                                Spark streaming multiple sources, reload dataframe
                            
                                Mixed Effects Models in Spark or other technology
                            
                                Spark java Issue creating row with java.util.Map type
                            
                                Efficient text preprocessing using PySpark (clean, tokenize, stopwords, stemming, filter)
                            
                                Election of new zookeeper leader shuts down the Spark Master
                            
                                NullPointerException thrown in where it can't be thrown
                            
                                Is Spark SQL UDAF (user defined aggregate function) available in the Python API?
                            
                                Why does PySpark fail with random "Socket is closed" error?
                            
                                Caching ordered Spark DataFrame creates unwanted job
                            
                                Spark streaming + Kafka vs Just Kafka
                            
                                Spark for kubernetes - Azure Blob Storage credentials issue
                            
                                Websphere MQ as a data source for Apache Spark Streaming
                            
                                How to integrate Apache Spark with Spring MVC web application for interactive user sessions
                            
                                ClassNotFoundException: org.apache.spark.SparkConf with spark on hive
                            
                                pyLDAvis visualization of pyspark generated LDA model
                            
                                Apache Spark: User Memory vs Spark Memory
                            
                                KryoException: Buffer overflow with very small input
                            
                                Submitting jobs to Spark EC2 cluster remotely

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With