I am using Spark and Scala on my laptop at this moment.
When I write an RDD to a file, the output is written to two files "part-00000" and "part-00001". How can I force Spark / Scala to write to one file?
My code is currently:
myRDD.map(x => x._1 + "," + x._2).saveAsTextFile("/path/to/output")
where I am removing the parentheses to write out key,value pairs.
The "problem" is indeed a feature, and it is produced by how your RDD
is partitioned, hence it is separated in n
parts where n
is the number of partitions. To fix this you just need to change the number of partitions to one, by using repartition on your RDD
. The documentation states:
repartition(numPartitions)
Return a new RDD that has exactly numPartitions partitions.
Can increase or decrease the level of parallelism in this RDD. Internally, this uses a shuffle to redistribute data. If you are decreasing the number of partitions in this RDD, consider using coalesce, which can avoid performing a shuffle.
For example, this change should work.
myRDD.map(x => x._1 + "," + x._2).repartition(1).saveAsTextFile("/path/to/output")
As the documentation says you can also use coalesce, which is actually the recommended option when you are reducing the number of partitions. However, reducing the number of partitions to one is considered a bad idea, because it causes shuffling of the data to one node and loss of parallelism.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With