Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merging multiple files into one within Hadoop

I get multiple small files into my input directory which I want to merge into a single file without using the local file system or writing mapreds. Is there a way I could do it using hadoof fs commands or Pig?

Thanks!

like image 677
uHadoop Avatar asked Aug 23 '10 13:08

uHadoop


People also ask

Is used for merging a list of files in a directory into a single file?

Linux users can merge two or more files into one file using the merge command or lines of files using the paste command.


2 Answers

In order to keep everything on the grid use hadoop streaming with a single reducer and cat as the mapper and reducer (basically a noop) - add compression using MR flags.

hadoop jar \     $HADOOP_PREFIX/share/hadoop/tools/lib/hadoop-streaming.jar \<br>     -Dmapred.reduce.tasks=1 \     -Dmapred.job.queue.name=$QUEUE \     -input "$INPUT" \     -output "$OUTPUT" \     -mapper cat \     -reducer cat 

If you want compression add
-Dmapred.output.compress=true \ -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec

like image 80
Guy B Avatar answered Sep 17 '22 15:09

Guy B


hadoop fs -getmerge <dir_of_input_files> <mergedsinglefile> 
like image 44
Harsha Hulageri Avatar answered Sep 19 '22 15:09

Harsha Hulageri