Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spark write to disk with N files less than N partitions

Can we write data to say 100 files, with 10 partitions in each file?

I know we can use repartition or coalesce to reduce number of partition. But I have seen some hadoop generated avro data with much more partitions than number of files.

like image 238
Kenny Avatar asked Jan 08 '18 01:01

Kenny


1 Answers

The number of files that get written out is controlled by the parallelization of your DataFrame or RDD. So if your data is split across 10 Spark partitions you cannot write fewer than 10 files without reducing partitioning (e.g. coalesce or repartition).

Now, having said that when data is read back in it could be split into smaller chunks based on your configured split size but depending on format and/or compression.

If instead you want to increase the number of files written per Spark partition (e.g. to prevent files that are too large), Spark 2.2 introduces a maxRecordsPerFile option when you write data out. With this you can limit the number of records that get written per file in each partition. The other option of course would be to repartition.

The following will result in 2 files being written out even though it's only got 1 partition:

val df = spark.range(100).coalesce(1)
df.write.option("maxRecordsPerFile", 50).save("/tmp/foo")
like image 89
Silvio Avatar answered Dec 01 '22 16:12

Silvio