Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I merge spark results files without repartition and copyMerge?

I use the next code:

csv.saveAsTextFile(pathToResults, classOf[GzipCodec])

pathToResults directory has many files like part-0000, part-0001 etc. I can use FileUtil.copyMerge(), but it's really slow, it's download all files on driver program and then upload them in hadoop. But FileUtil.copyMerge() faster than:

csv.repartition(1).saveAsTextFile(pathToResults, classOf[GzipCodec])

How can I merge spark results files without repartition and FileUtil.copyMerge()?

like image 802
Leonard Avatar asked Mar 13 '15 04:03

Leonard


Video Answer


1 Answers

Unfortunately, there is not other option to get a single output file in Spark. Instead of repartition(1) you can use coalesce(1), but with parameter 1 their behavior would be the same. Spark would collect your data in a single partition in memory which might cause OOM error if your data is too big.

Another option for merging files on HDFS might be to write a simple MapReduce job (or Pig job, or Hadoop Streaming job) that would get the whole directory as an input and using a single reducer generate you a single output file. But be aware that with the MapReduce approach all the data would be first copied to the reducer local filesystem which might cause "out of space" error.

Here are some useful links on the same topic:

  • merge output files after reduce phase
  • Merging hdfs files
  • Merging multiple files into one within Hadoop
like image 62
0x0FFF Avatar answered Oct 23 '22 07:10

0x0FFF