I have a directory structure based on two partitions, like this:
People
> surname=Doe
> name=John
> name=Joe
> surname=White
> name=Josh
> name=Julien
I am reading parquet files with information only about all Does, and therefore I am directly specifying surname=Doe as an output directory for my DataFrame. Now the problem is I am trying to add name-based partitioning with partitionBy("name")
on writing.
df.write.partitionBy("name").parquet(outputDir)
(outputDir contains a path to Doe directory)
This causes an error like below:
Caused by: java.lang.AssertionError: assertion failed: Conflicting partition column names detected:
Partition column name list #0: surname, name
Partition column name list #1: surname
Any tips how to solve it? It probably occurs because of the _SUCCESS
file created in the surname directory, which gives wrong hints to Spark - when I remove _SUCCESS
and _metadata
files Spark is able to read everything without any issue.
I have managed to solve it with a workaround - I don't think this is a good idea, but I disabled creating additional _SUCCESS and _metadata files with:
sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
That way Spark won't get any stupid ideas about the partitioning structures.
Another option is saving to the "proper" directory - People and partition by surname and name, but then you have to keep in mind that the only sane option is setting SaveMode
to Append
and manually deleting the directories you expect to be overwritten (this is really error-prone):
df.write.mode(SaveMode.Append).partitionBy("surname","name").parquet("/People")
Do not use owerwrite SaveMode in this case - this will delete ALL of the surname directores.
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
is fairly sensible, if you have summary metadata enabled then writing the metadata file can become an IO bottleneck on reads and writes.
The alternative way to your solution might be to add a .mode("append") to your write, but with the original parent directory as the destination,
df.write.mode("append").partitionBy("name").parquet("/People")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With