I use the next code:
csv.saveAsTextFile(pathToResults, classOf[GzipCodec])
pathToResults directory has many files like part-0000, part-0001 etc. I can use FileUtil.copyMerge(), but it's really slow, it's download all files on driver program and then upload them in hadoop. But FileUtil.copyMerge() faster than:
csv.repartition(1).saveAsTextFile(pathToResults, classOf[GzipCodec])
How can I merge spark results files without repartition and FileUtil.copyMerge()?
Unfortunately, there is not other option to get a single output file in Spark. Instead of repartition(1)
you can use coalesce(1)
, but with parameter 1
their behavior would be the same. Spark would collect your data in a single partition in memory which might cause OOM error if your data is too big.
Another option for merging files on HDFS might be to write a simple MapReduce job (or Pig job, or Hadoop Streaming job) that would get the whole directory as an input and using a single reducer generate you a single output file. But be aware that with the MapReduce approach all the data would be first copied to the reducer local filesystem which might cause "out of space" error.
Here are some useful links on the same topic:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With