Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Overwrite a Parquet file with Pyspark

I'm trying to overwrite a Parquet file stored in an S3 bucket using PySpark. The bucket has versioning enabled.

Here’s the code I’m using:

Initial Write (v1):

df_v1.repartition(1).write.parquet(path='s3a://bucket/file1.parquet')

Update and Overwrite (v2):

df_v1 = spark.read.parquet("s3a://bucket/file1.parquet")
df_v2 = df_v1...  # some transformation
df_v2.repartition(1).write.mode("overwrite").parquet('s3a://bucket/file1.parquet')

Issue:
After writing df_v2, when I read the data back, it contains rows from both df_v1 and df_v2. Additionally, I notice that:

  • After the first write, there is one part-*.snappy.parquet file.
  • After the second write (with overwrite mode), there are two such files.

It seems like the overwrite is not working as expected and behaves more like an append.

Environment:

  • Spark version: 2.4.4
  • Hadoop version: 2.7.3

Question:
Why is the overwrite mode not replacing the existing data in S3? Is this related to S3 versioning or something else I’m missing?

like image 207
Phil Avatar asked Nov 01 '25 08:11

Phil


1 Answers

The issue you're facing is likely due to how Amazon S3 works under the hood.

S3 is not a traditional file system—it's a key-value object store. This means there are no real folders, just object keys that look like paths. For example, when you write to:

s3a://bucket/file1.parquet/

Spark creates objects with keys like:

s3a://bucket/file1.parquet/part-00000-xxxx.snappy.parquet

When you use .mode("overwrite"), Spark tries to delete the existing "directory" before writing new data. But since S3 doesn't have actual directories, and especially if versioning is enabled, the old files may not be deleted properly. Instead, Spark just adds new files under the same prefix, which results in both old and new data being present—effectively behaving like an append.

like image 132
Steven Avatar answered Nov 04 '25 04:11

Steven



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!