Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop: How can i merge reducer outputs to a single file? [duplicate]

I know that "getmerge" command in shell can do this work.

But what should I do if I want to merge these outputs after the job by HDFS API for java?

What i actually want is a single merged file on HDFS.

The only thing i can think of is to start an additional job after that.

thanks!

like image 779
thomaslee Avatar asked Oct 16 '12 09:10

thomaslee


1 Answers

But what should I do if I want to merge these outputs after the job by HDFS API for java?

Guessing, because I haven't tried this myself, but I think the method you are looking for is FileUtil.copyMerge, which is the method that FsShell invokes when you run the -getmerge command. FileUtil.copyMerge takes two FileSystem objects as arguments - FsShell uses FileSystem.getLocal to retrieve the destination FileSystem, but I don't see any reason you couldn't instead use Path.getFileSystem on the destination to obtain an OutputStream

That said, I don't think it wins you very much -- the merge is still happening in the local JVM; so you aren't really saving very much over -getmerge followed by -put.

like image 86
VoiceOfUnreason Avatar answered Oct 14 '22 00:10

VoiceOfUnreason