Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark saveAsTextFile() writes to multiple files instead of one [duplicate]

I am using Spark and Scala on my laptop at this moment.

When I write an RDD to a file, the output is written to two files "part-00000" and "part-00001". How can I force Spark / Scala to write to one file?

My code is currently:

myRDD.map(x => x._1 + "," + x._2).saveAsTextFile("/path/to/output")

where I am removing the parentheses to write out key,value pairs.

like image 233
stackoverflowuser2010 Avatar asked Feb 17 '16 00:02

stackoverflowuser2010


1 Answers

The "problem" is indeed a feature, and it is produced by how your RDD is partitioned, hence it is separated in n parts where n is the number of partitions. To fix this you just need to change the number of partitions to one, by using repartition on your RDD. The documentation states:

repartition(numPartitions)

Return a new RDD that has exactly numPartitions partitions.

Can increase or decrease the level of parallelism in this RDD. Internally, this uses a shuffle to redistribute data. If you are decreasing the number of partitions in this RDD, consider using coalesce, which can avoid performing a shuffle.

For example, this change should work.

myRDD.map(x => x._1 + "," + x._2).repartition(1).saveAsTextFile("/path/to/output")

As the documentation says you can also use coalesce, which is actually the recommended option when you are reducing the number of partitions. However, reducing the number of partitions to one is considered a bad idea, because it causes shuffling of the data to one node and loss of parallelism.

like image 138
Alberto Bonsanto Avatar answered Sep 27 '22 22:09

Alberto Bonsanto