Overwrite specific partitions in spark dataframe write method

Tags:

I want to overwrite specific partitions instead of all in spark. I am trying the following command:

df.write.orc('maprfs:///hdfs-base-path','overwrite',partitionBy='col4')

where df is dataframe having the incremental data to be overwritten.

hdfs-base-path contains the master data.

When I try the above command, it deletes all the partitions, and inserts those present in df at the hdfs path.

What my requirement is to overwrite only those partitions present in df at the specified hdfs path. Can someone please help me in this?

227

asked Jul 20 '16 18:07

yatin

1 Answers

Finally! This is now a feature in Spark 2.3.0: SPARK-20236

To use it, you need to set the spark.sql.sources.partitionOverwriteMode setting to dynamic, the dataset needs to be partitioned, and the write mode overwrite. Example:

spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic") data.write.mode("overwrite").insertInto("partitioned_table")

I recommend doing a repartition based on your partition column before writing, so you won't end up with 400 files per folder.

Before Spark 2.3.0, the best solution would be to launch SQL statements to delete those partitions and then write them with mode append.

125

answered Oct 13 '22 05:10

Madhava Carrillo

Related questions
                            
                                How to find median and quantiles using Spark
                            
                                Pyspark: Split multiple array columns into rows
                            
                                What is the relationship between workers, worker instances, and executors?
                            
                                Is it possible to get the current spark context settings in PySpark?
                            
                                How to pivot Spark DataFrame?
                            
                                how to make saveAsTextFile NOT split output into multiple file?
                            
                                How to prevent java.lang.OutOfMemoryError: PermGen space at Scala compilation?
                            
                                Pyspark: Exception: Java gateway process exited before sending the driver its port number
                            
                                How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?
                            
                                Spark difference between reduceByKey vs. groupByKey vs. aggregateByKey vs. combineByKey
                            
                                Which cluster type should I choose for Spark?
                            
                                How does HashPartitioner work?
                            
                                How to link PyCharm with PySpark?
                            
                                How to pass -D parameter or environment variable to Spark job?
                            
                                Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame
                            
                                How to write unit tests in Spark 2.0+?
                            
                                Updating a dataframe column in spark
                            
                                Spark SQL: apply aggregate functions to a list of columns
                            
                                Get current number of partitions of a DataFrame
                            
                                How to fix 'TypeError: an integer is required (got type bytes)' error when trying to run pyspark after installing spark 2.4.4

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Overwrite specific partitions in spark dataframe write method

Tags:

apache-spark

apache-spark-sql

spark-dataframe

yatin

People also ask

1 Answers

Madhava Carrillo

Recent Activity

Donate For Us