Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merging hdfs files

Tags:

hadoop

hdfs

I have 1000+ files available in HDFS with a naming convention of 1_fileName.txt to N_fileName.txt. Size of each file is 1024 MB. I need to merge these files in to one (HDFS)with keeping the order of the file. Say 5_FileName.txt should append only after 4_fileName.txt

What is the best and fastest way to perform this operation.

Is there any method to perform this merging without copying the actual data between data nodes? For e-g: Get the block locations of this files and create a new entry (FileName) in the Namenode with these block locations?

like image 797
JoRoot Avatar asked Feb 12 '13 11:02

JoRoot


People also ask

How do I move files from one HDFS folder to another?

You can use the cp command in Hadoop. This command is similar to the Linux cp command, and it is used for copying files from one directory to another directory within the HDFS file system.


3 Answers

There is no efficient way of doing this, you'll need to move all the data to one node, then back to HDFS.

A command line scriptlet to do this could be as follows:

hadoop fs -text *_fileName.txt | hadoop fs -put - targetFilename.txt

This will cat all files that match the glob to standard output, then you'll pipe that stream to the put command and output the stream to an HDFS file named targetFilename.txt

The only problem you have is the filename structure you have gone for - if you have fixed width, zeropadded the number part it would be easier, but in it's current state you'll get an unexpected lexigraphic order (1, 10, 100, 1000, 11, 110, etc) rather than numeric order (1,2,3,4, etc). You could work around this by amending the scriptlet to:

hadoop fs -text [0-9]_fileName.txt [0-9][0-9]_fileName.txt \
    [0-9][0-9[0-9]_fileName.txt | hadoop fs -put - targetFilename.txt
like image 175
Chris White Avatar answered Oct 16 '22 14:10

Chris White


There is an API method org.apache.hadoop.fs.FileUtil.copyMerge that performs this operation:

public static boolean copyMerge(
                    FileSystem srcFS,
                    Path srcDir,
                    FileSystem dstFS,
                    Path dstFile,
                    boolean deleteSource,
                    Configuration conf,
                    String addString)

It reads all files in srcDir in alphabetical order and appends their content to dstFile.

like image 30
Dmitry Avatar answered Oct 16 '22 12:10

Dmitry


If you can use spark. It can be done like

sc.textFile("hdfs://...../part*).coalesce(1).saveAsTextFile("hdfs://...../filename)

Hope this works, since spark works in distributed fashion, you wont have to copy filed into one node. Though just a caution, coalescing files in spark can be slow if the files are very large.

like image 24
user2200660 Avatar answered Oct 16 '22 13:10

user2200660