spark write to disk with N files less than N partitions

Question

Can we write data to say 100 files, with 10 partitions in each file?

I know we can use repartition or coalesce to reduce number of partition. But I have seen some hadoop generated avro data with much more partitions than number of files.

Silvio · Accepted Answer

The number of files that get written out is controlled by the parallelization of your DataFrame or RDD. So if your data is split across 10 Spark partitions you cannot write fewer than 10 files without reducing partitioning (e.g. coalesce or repartition).

Now, having said that when data is read back in it could be split into smaller chunks based on your configured split size but depending on format and/or compression.

If instead you want to increase the number of files written per Spark partition (e.g. to prevent files that are too large), Spark 2.2 introduces a maxRecordsPerFile option when you write data out. With this you can limit the number of records that get written per file in each partition. The other option of course would be to repartition.

The following will result in 2 files being written out even though it's only got 1 partition:

val df = spark.range(100).coalesce(1)
df.write.option("maxRecordsPerFile", 50).save("/tmp/foo")

spark write to disk with N files less than N partitions

Tags:

apache-spark

partition

Kenny

1 Answers

Silvio

Recent Activity

Donate For Us

spark write to disk with N files less than N partitions

Tags:

apache-spark

partition

Kenny

1 Answers

Silvio

Related questions

Recent Activity

Donate For Us