How can I merge spark results files without repartition and copyMerge?

Question

I use the next code:

csv.saveAsTextFile(pathToResults, classOf[GzipCodec])

pathToResults directory has many files like part-0000, part-0001 etc. I can use FileUtil.copyMerge(), but it's really slow, it's download all files on driver program and then upload them in hadoop. But FileUtil.copyMerge() faster than:

csv.repartition(1).saveAsTextFile(pathToResults, classOf[GzipCodec])

How can I merge spark results files without repartition and FileUtil.copyMerge()?

0x0FFF · Accepted Answer

Unfortunately, there is not other option to get a single output file in Spark. Instead of repartition(1) you can use coalesce(1), but with parameter 1 their behavior would be the same. Spark would collect your data in a single partition in memory which might cause OOM error if your data is too big.

Another option for merging files on HDFS might be to write a simple MapReduce job (or Pig job, or Hadoop Streaming job) that would get the whole directory as an input and using a single reducer generate you a single output file. But be aware that with the MapReduce approach all the data would be first copied to the reducer local filesystem which might cause "out of space" error.

Here are some useful links on the same topic:

merge output files after reduce phase
Merging hdfs files
Merging multiple files into one within Hadoop

How can I merge spark results files without repartition and copyMerge?

Tags:

scala

apache-spark

hadoop

Leonard

Video Answer

1 Answers

0x0FFF

Recent Activity

Donate For Us

How can I merge spark results files without repartition and copyMerge?

Tags:

scala

apache-spark

hadoop

Leonard

Video Answer

1 Answers

0x0FFF

Related questions

Recent Activity

Donate For Us