Append new data to partitioned parquet files

Tags:

I am writing an ETL process where I will need to read hourly log files, partition the data, and save it. I am using Spark (in Databricks). The log files are CSV so I read them and apply a schema, then perform my transformations.

My problem is, how can I save each hour's data as a parquet format but append to the existing data set? When saving, I need to partition by 4 columns present in the dataframe.

Here is my save line:

data
    .filter(validPartnerIds($"partnerID"))
    .write
    .partitionBy("partnerID","year","month","day")
    .parquet(saveDestination)

The problem is that if the destination folder exists the save throws an error. If the destination doesn't exist then I am not appending my files.

I've tried using .mode("append") but I find that Spark sometimes fails midway through so I end up loosing how much of my data is written and how much I still need to write.

I am using parquet because the partitioning substantially increases my querying in the future. As well, I must write the data as some file format on disk and cannot use a database such as Druid or Cassandra.

Any suggestions for how to partition my dataframe and save the files (either sticking to parquet or another format) is greatly appreciated.

998

asked Jan 21 '16 22:01

Saman

2 Answers

If you need to append the files, you definitely have to use the append mode. I don't know how many partitions you expect it to generate, but I find that if you have many partitions, partitionBy will cause a number of problems (memory- and IO-issues alike).

If you think that your problem is caused by write operations taking too long, I recommend that you try these two things:

1) Use snappy by adding to the configuration:

conf.set("spark.sql.parquet.compression.codec", "snappy")

2) Disable generation of the metadata files in the hadoopConfiguration on the SparkContext like this:

sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")

The metadata-files will be somewhat time consuming to generate (see this blog post), but according to this they are not actually important. Personally, I always disable them and have no issues.

If you generate many partitions (> 500), I'm afraid the best I can do is suggest to you that you look into a solution not using append-mode - I simply never managed to get partitionBy to work with that many partitions.

answered Sep 24 '22 02:09

Glennie Helles Sindholt

If you're using unsorted partitioning your data is going to be split across all of your partitions. That means every task will generate and write data to each of your output files.

Consider repartitioning your data according to your partition columns before writing to have all the data per output file on the same partitions:

data
 .filter(validPartnerIds($"partnerID"))
 .repartition([optional integer,] "partnerID","year","month","day")
 .write
 .partitionBy("partnerID","year","month","day")
 .parquet(saveDestination)

See: DataFrame.repartition

answered Sep 27 '22 02:09

MrChrisRodriguez

Related questions
                            
                                UnsatisfiedLinkError with native library under sbt
                            
                                Difference between F[_] and F[T] In Scala when used in type constructors
                            
                                Scala SBT: standalone jar
                            
                                Why does Option not extend the Iterable trait directly?
                            
                                Convert any Scala object to JSON
                            
                                Use Scala to unit test Java?
                            
                                Ambiguous Reference to overloaded definition - One vs Two Parameters
                            
                                Compare json equality in Scala
                            
                                ORM for Lift: Mapper or JPA?
                            
                                Why can't the first parameter list of a class be implicit?
                            
                                Maven: mixing Java and Scala in one project
                            
                                Idiomatic Scala translation of Kiselyov's zippers?
                            
                                Finagle and Akka, why not use them together?
                            
                                Proxies / delegates in Scala
                            
                                all but the last item from a Scala Iterator (a.k.a. Iterator.init)
                            
                                Using futures and Thread.sleep
                            
                                SQLite for Scala
                            
                                Do monad transformers apply to getting JSON from services?
                            
                                Breakpoints from Scala Worksheet?
                            
                                Eta-expansion between methods and functions with overloaded methods in Scala

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Append new data to partitioned parquet files

Tags:

append

scala

apache-spark

parquet

Saman

People also ask

2 Answers

Glennie Helles Sindholt

MrChrisRodriguez

Recent Activity

Donate For Us