Saving to parquet subpartition

Question

I have a directory structure based on two partitions, like this:

  People
  > surname=Doe
        > name=John
        > name=Joe
  > surname=White
        > name=Josh
        > name=Julien

I am reading parquet files with information only about all Does, and therefore I am directly specifying surname=Doe as an output directory for my DataFrame. Now the problem is I am trying to add name-based partitioning with partitionBy("name") on writing.

df.write.partitionBy("name").parquet(outputDir)

(outputDir contains a path to Doe directory)

This causes an error like below:

  Caused by: java.lang.AssertionError: assertion failed: Conflicting partition column names detected:
    Partition column name list #0: surname, name
    Partition column name list #1: surname

Any tips how to solve it? It probably occurs because of the _SUCCESS file created in the surname directory, which gives wrong hints to Spark - when I remove _SUCCESS and _metadata files Spark is able to read everything without any issue.

TheMP · Accepted Answer

I have managed to solve it with a workaround - I don't think this is a good idea, but I disabled creating additional _SUCCESS and _metadata files with:

sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")

That way Spark won't get any stupid ideas about the partitioning structures.

Another option is saving to the "proper" directory - People and partition by surname and name, but then you have to keep in mind that the only sane option is setting SaveMode to Append and manually deleting the directories you expect to be overwritten (this is really error-prone):

df.write.mode(SaveMode.Append).partitionBy("surname","name").parquet("/People")

Do not use owerwrite SaveMode in this case - this will delete ALL of the surname directores.

Ewan Leith · Answer

sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")

is fairly sensible, if you have summary metadata enabled then writing the metadata file can become an IO bottleneck on reads and writes.

The alternative way to your solution might be to add a .mode("append") to your write, but with the original parent directory as the destination,

df.write.mode("append").partitionBy("name").parquet("/People")

Saving to parquet subpartition

Tags:

apache-spark

apache-spark-sql

TheMP

2 Answers

TheMP

Ewan Leith

Recent Activity

Donate For Us

Saving to parquet subpartition

Tags:

apache-spark

apache-spark-sql

TheMP

2 Answers

TheMP

Ewan Leith

Related questions

Recent Activity

Donate For Us