Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Saving to parquet subpartition

I have a directory structure based on two partitions, like this:

  People
  > surname=Doe
        > name=John
        > name=Joe
  > surname=White
        > name=Josh
        > name=Julien

I am reading parquet files with information only about all Does, and therefore I am directly specifying surname=Doe as an output directory for my DataFrame. Now the problem is I am trying to add name-based partitioning with partitionBy("name") on writing.

df.write.partitionBy("name").parquet(outputDir)

(outputDir contains a path to Doe directory)

This causes an error like below:

  Caused by: java.lang.AssertionError: assertion failed: Conflicting partition column names detected:
    Partition column name list #0: surname, name
    Partition column name list #1: surname

Any tips how to solve it? It probably occurs because of the _SUCCESS file created in the surname directory, which gives wrong hints to Spark - when I remove _SUCCESS and _metadata files Spark is able to read everything without any issue.

like image 719
TheMP Avatar asked Sep 29 '15 11:09

TheMP


2 Answers

I have managed to solve it with a workaround - I don't think this is a good idea, but I disabled creating additional _SUCCESS and _metadata files with:

sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")

That way Spark won't get any stupid ideas about the partitioning structures.

Another option is saving to the "proper" directory - People and partition by surname and name, but then you have to keep in mind that the only sane option is setting SaveMode to Append and manually deleting the directories you expect to be overwritten (this is really error-prone):

df.write.mode(SaveMode.Append).partitionBy("surname","name").parquet("/People")

Do not use owerwrite SaveMode in this case - this will delete ALL of the surname directores.

like image 173
TheMP Avatar answered Sep 28 '22 02:09

TheMP


sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")

is fairly sensible, if you have summary metadata enabled then writing the metadata file can become an IO bottleneck on reads and writes.

The alternative way to your solution might be to add a .mode("append") to your write, but with the original parent directory as the destination,

df.write.mode("append").partitionBy("name").parquet("/People")
like image 23
Ewan Leith Avatar answered Sep 28 '22 03:09

Ewan Leith