How can I control the number of output files written from Spark DataFrame?

Question

Using Spark streaming to read Json data from Kafka topic.
I use DataFrame to process the data, and later I wish to save the output to HDFS files. The problem is that using:

df.write.save("append").format("text")

Yields many files some are large, and some are even 0 bytes.

Is there a way to control the number of output files? Also, to avoid the "opposite" problem, is there a way to also limit the size of each file so a new file will be written to when the current reaches a certain size/num of rows?

user9898004 · Accepted Answer

The number of the output files is equal to the number of partitions of the Dataset This means you can control it in a number of way, depending on the context:

For Datasets with no wide dependencies you can control input using reader specific parameters
For Datasets with wide dependencies you can control number of partitions with spark.sql.shuffle.partitions parameter.
Independent of the lineage you can coalesce or repartition.

is there a way to also limit the size of each file so a new file will be written to when the current reaches a certain size/num of rows?

No. With built-in writers it is strictly 1:1 relationship.

How can I control the number of output files written from Spark DataFrame?

Tags:

scala

apache-kafka

apache-spark

apache-spark-sql

spark-streaming

DigitalFailure

1 Answers

user9898004

Recent Activity

Donate For Us

How can I control the number of output files written from Spark DataFrame?

Tags:

scala

apache-kafka

apache-spark

apache-spark-sql

spark-streaming

DigitalFailure

1 Answers

user9898004

Related questions

Recent Activity

Donate For Us