Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I control the number of output files written from Spark DataFrame?

Using Spark streaming to read Json data from Kafka topic.
I use DataFrame to process the data, and later I wish to save the output to HDFS files. The problem is that using:

df.write.save("append").format("text")

Yields many files some are large, and some are even 0 bytes.

Is there a way to control the number of output files? Also, to avoid the "opposite" problem, is there a way to also limit the size of each file so a new file will be written to when the current reaches a certain size/num of rows?

like image 614
DigitalFailure Avatar asked Mar 06 '23 01:03

DigitalFailure


1 Answers

The number of the output files is equal to the number of partitions of the Dataset This means you can control it in a number of way, depending on the context:

  • For Datasets with no wide dependencies you can control input using reader specific parameters
  • For Datasets with wide dependencies you can control number of partitions with spark.sql.shuffle.partitions parameter.
  • Independent of the lineage you can coalesce or repartition.

is there a way to also limit the size of each file so a new file will be written to when the current reaches a certain size/num of rows?

No. With built-in writers it is strictly 1:1 relationship.

like image 190
user9898004 Avatar answered Mar 10 '23 10:03

user9898004