I want to overwrite specific partitions instead of all in spark. I am trying the following command:
df.write.orc('maprfs:///hdfs-base-path','overwrite',partitionBy='col4')
where df is dataframe having the incremental data to be overwritten.
hdfs-base-path contains the master data.
When I try the above command, it deletes all the partitions, and inserts those present in df at the hdfs path.
What my requirement is to overwrite only those partitions present in df at the specified hdfs path. Can someone please help me in this?
A write that dynamically overwrites partitions removes all existing data in each logical partition for which the write will commit new data. Any existing logical partition for which the write does not contain data will remain unchanged.
If you want to increase the partitions of your DataFrame, all you need to run is the repartition() function. Returns a new DataFrame partitioned by the given partitioning expressions. The resulting DataFrame is hash partitioned.
Finally! This is now a feature in Spark 2.3.0: SPARK-20236
To use it, you need to set the spark.sql.sources.partitionOverwriteMode
setting to dynamic, the dataset needs to be partitioned, and the write mode overwrite
. Example:
spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic") data.write.mode("overwrite").insertInto("partitioned_table")
I recommend doing a repartition based on your partition column before writing, so you won't end up with 400 files per folder.
Before Spark 2.3.0, the best solution would be to launch SQL statements to delete those partitions and then write them with mode append.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With