How can we overwrite a partitioned dataset, but only the partitions we are going to change? For example, recomputing last week daily job, and only overwriting last week of data. Default Spark behaviour is to overwrite the whole table, even if only some partitions are going to be written.

Since Spark 2.3.0 this is an option when overwriting a table. To overwrite it, you need to set the new <code>spark.sql.sources.partitionOverwriteMode</code> setting to <code>dynamic</code>, the dataset needs to be partitioned, and the write mode <code>overwrite</code>. Example in scala: <pre class="prettyprint lang-scala prettyprint-override"><code>spark.conf.set( "spark.sql.sources.partitionOverwriteMode", "dynamic" ) data.write.mode("overwrite").insertInto("partitioned_table") </code></pre> I recommend doing a repartition based on your partition column before writing, so you won't end up with 400 files per folder. Before Spark 2.3.0, the best solution would be to launch SQL statements to delete those partitions and then write them with mode append.

Just FYI, for PySpark users make sure to set <code>overwrite=True</code> in the <code>insertInto</code> otherwise the mode would be changed to <code>append</code> from the source code: <pre class="prettyprint"><code>def insertInto(self, tableName, overwrite=False): self._jwrite.mode( "overwrite" if overwrite else "append" ).insertInto(tableName) </code></pre> this how to use it: <pre class="prettyprint"><code>spark.conf.set("spark.sql.sources.partitionOverwriteMode","DYNAMIC") data.write.insertInto("partitioned_table", overwrite=True) </code></pre> or in the SQL version works fine. <pre class="prettyprint"><code>INSERT OVERWRITE TABLE [db_name.]table_name [PARTITION part_spec] select_statement </code></pre> for doc look at here

Overwrite only some partitions in a partitioned spark Dataset

2 Answers

Since Spark 2.3.0 this is an option when overwriting a table. To overwrite it, you need to set the new spark.sql.sources.partitionOverwriteMode setting to dynamic, the dataset needs to be partitioned, and the write mode overwrite. Example in scala:

Click to copy

spark.conf.set(   "spark.sql.sources.partitionOverwriteMode", "dynamic" ) data.write.mode("overwrite").insertInto("partitioned_table")

I recommend doing a repartition based on your partition column before writing, so you won't end up with 400 files per folder.

Before Spark 2.3.0, the best solution would be to launch SQL statements to delete those partitions and then write them with mode append.

103

answered Oct 07 '22 18:10

Madhava Carrillo

Just FYI, for PySpark users make sure to set overwrite=True in the insertInto otherwise the mode would be changed to append

from the source code:

Click to copy

def insertInto(self, tableName, overwrite=False):     self._jwrite.mode(         "overwrite" if overwrite else "append"     ).insertInto(tableName)

this how to use it:

Click to copy

spark.conf.set("spark.sql.sources.partitionOverwriteMode","DYNAMIC") data.write.insertInto("partitioned_table", overwrite=True)

or in the SQL version works fine.

Click to copy

INSERT OVERWRITE TABLE [db_name.]table_name [PARTITION part_spec] select_statement

for doc look at here

answered Oct 07 '22 18:10

Ali Bey

Related questions
                            
                                Java lambda method and new Object
                            
                                installing amqp on mac with brew
                            
                                Periodically replacing values in a list
                            
                                Can't install win32gui
                            
                                Cannot use old NDK (android-ndk-r17c) after Catalina upgrade due to new security
                            
                                react-native : can't push to git because of hprof file
                            
                                Hide home indicator in xib file
                            
                                React & TypeScript: Avoid context default value
                            
                                Automatically checking for a new version of my application
                            
                                C++ interview - testing potential candidates
                            
                                Drawing circles with System.Drawing
                            
                                Getting an attribute value in xml element

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Overwrite only some partitions in a partitioned spark Dataset

Tags:

Madhava Carrillo

People also ask

2 Answers

Madhava Carrillo

Ali Bey

Recent Activity

Donate For Us