Using Spark streaming to read Json data from Kafka topic.
I use DataFrame to process the data, and later I wish to save the output to HDFS files.  The problem is that using:
df.write.save("append").format("text")
Yields many files some are large, and some are even 0 bytes.
Is there a way to control the number of output files? Also, to avoid the "opposite" problem, is there a way to also limit the size of each file so a new file will be written to when the current reaches a certain size/num of rows?
The number of the output files is equal to the number of partitions of the Dataset This means you can control it in a number of way, depending on the context:
Datasets with no wide dependencies you can control input using reader specific parametersDatasets with wide dependencies you can control number of partitions with spark.sql.shuffle.partitions parameter.coalesce or repartition.is there a way to also limit the size of each file so a new file will be written to when the current reaches a certain size/num of rows?
No. With built-in writers it is strictly 1:1 relationship.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With