Spark dataframe write method writing many small files

Question

I've got a fairly simple job coverting log files to parquet. It's processing 1.1TB of data (chunked into 64MB - 128MB files - our block size is 128MB), which is approx 12 thousand files.

Job works as follows:

 val events = spark.sparkContext
  .textFile(s"$stream/$sourcetype")
  .map(_.split(" \|\| ").toList)
  .collect{case List(date, y, "Event") => MyEvent(date, y, "Event")}
  .toDF()

df.write.mode(SaveMode.Append).partitionBy("date").parquet(s"$path")

It collects the events with a common schema, converts to a DataFrame, and then writes out as parquet.

The problem I'm having is that this can create a bit of an IO explosion on the HDFS cluster, as it's trying to create so many tiny files.

Ideally I want to create only a handful of parquet files within the partition 'date'.

What would be the best way to control this? Is it by using 'coalesce()'?

How will that effect the amount of files created in a given partition? Is it dependent on how many executors I have working in Spark? (currently set at 100).

Raphael Roth · Accepted Answer

you have to repartiton your DataFrame to match the partitioning of the DataFrameWriter

Try this:

df
.repartition($"date")
.write.mode(SaveMode.Append)
.partitionBy("date")
.parquet(s"$path")

Spark dataframe write method writing many small files

Tags:

scala

apache-spark

user3030878

1 Answers

Raphael Roth

Recent Activity

Donate For Us

Spark dataframe write method writing many small files

Tags:

scala

apache-spark

user3030878

1 Answers

Raphael Roth

Related questions

Recent Activity

Donate For Us