I have 1000+ files available in HDFS with a naming convention of <code>1_fileName.txt</code> to <code>N_fileName.txt</code>. Size of each file is 1024 MB. I need to merge these files in to one (HDFS)with keeping the order of the file. Say <code>5_FileName.txt</code> should append only after <code>4_fileName.txt</code> What is the best and fastest way to perform this operation. Is there any method to perform this merging without copying the actual data between data nodes? For e-g: Get the block locations of this files and create a new entry (FileName) in the Namenode with these block locations?

There is no efficient way of doing this, you'll need to move all the data to one node, then back to HDFS. A command line scriptlet to do this could be as follows: <pre class="prettyprint"><code>hadoop fs -text *_fileName.txt | hadoop fs -put - targetFilename.txt </code></pre> This will cat all files that match the glob to standard output, then you'll pipe that stream to the put command and output the stream to an HDFS file named targetFilename.txt The only problem you have is the filename structure you have gone for - if you have fixed width, zeropadded the number part it would be easier, but in it's current state you'll get an unexpected lexigraphic order (1, 10, 100, 1000, 11, 110, etc) rather than numeric order (1,2,3,4, etc). You could work around this by amending the scriptlet to: <pre class="prettyprint"><code>hadoop fs -text [0-9]_fileName.txt [0-9][0-9]_fileName.txt \ [0-9][0-9[0-9]_fileName.txt | hadoop fs -put - targetFilename.txt </code></pre>

If you can use spark. It can be done like <pre class="prettyprint"><code>sc.textFile("hdfs://...../part*).coalesce(1).saveAsTextFile("hdfs://...../filename) </code></pre> Hope this works, since spark works in distributed fashion, you wont have to copy filed into one node. Though just a caution, coalescing files in spark can be slow if the files are very large.

Merging hdfs files

Tags:

hadoop

hdfs

I have 1000+ files available in HDFS with a naming convention of 1_fileName.txt to N_fileName.txt. Size of each file is 1024 MB. I need to merge these files in to one (HDFS)with keeping the order of the file. Say 5_FileName.txt should append only after 4_fileName.txt

What is the best and fastest way to perform this operation.

Is there any method to perform this merging without copying the actual data between data nodes? For e-g: Get the block locations of this files and create a new entry (FileName) in the Namenode with these block locations?

797

asked Feb 12 '13 11:02

JoRoot

3 Answers

There is no efficient way of doing this, you'll need to move all the data to one node, then back to HDFS.

A command line scriptlet to do this could be as follows:

hadoop fs -text *_fileName.txt | hadoop fs -put - targetFilename.txt

This will cat all files that match the glob to standard output, then you'll pipe that stream to the put command and output the stream to an HDFS file named targetFilename.txt

The only problem you have is the filename structure you have gone for - if you have fixed width, zeropadded the number part it would be easier, but in it's current state you'll get an unexpected lexigraphic order (1, 10, 100, 1000, 11, 110, etc) rather than numeric order (1,2,3,4, etc). You could work around this by amending the scriptlet to:

hadoop fs -text [0-9]_fileName.txt [0-9][0-9]_fileName.txt \
    [0-9][0-9[0-9]_fileName.txt | hadoop fs -put - targetFilename.txt

175

answered Oct 16 '22 14:10

Chris White

There is an API method org.apache.hadoop.fs.FileUtil.copyMerge that performs this operation:

public static boolean copyMerge(
                    FileSystem srcFS,
                    Path srcDir,
                    FileSystem dstFS,
                    Path dstFile,
                    boolean deleteSource,
                    Configuration conf,
                    String addString)

It reads all files in srcDir in alphabetical order and appends their content to dstFile.

answered Oct 16 '22 12:10

Dmitry

If you can use spark. It can be done like

sc.textFile("hdfs://...../part*).coalesce(1).saveAsTextFile("hdfs://...../filename)

Hope this works, since spark works in distributed fashion, you wont have to copy filed into one node. Though just a caution, coalescing files in spark can be slow if the files are very large.

answered Oct 16 '22 13:10

user2200660

Related questions
                            
                                How to add partition using hive by a specific date?
                            
                                how to write subquery and use "In" Clause in Hive
                            
                                Hadoop "Permission denied (publickey,password,keyboard-interactive)" warning
                            
                                Distributed local clustering coefficient algorithm (MapReduce/Hadoop)
                            
                                R Hive Thrift Client
                            
                                Yarn MapReduce Job Issue - AM Container launch error in Hadoop 2.3.0
                            
                                Very basic question about Hadoop and compressed input files
                            
                                Spark 2.0 deprecates 'DirectParquetOutputCommitter', how to live without it?
                            
                                How does partitioning in MapReduce exactly work?
                            
                                Hbase / Hadoop Query Help
                            
                                Hadoop distributions [closed]
                            
                                Add PARTITION after creating TABLE in hive
                            
                                Json object to Parquet format using Java without converting to AVRO(Without using Spark, Hive, Pig,Impala)
                            
                                issue Running Spark Job on Yarn Cluster
                            
                                What is meant by sparse data/ datastore/ database?
                            
                                Hadoop gzip compressed files
                            
                                Where does Hadoop store the logs of YARN applications?
                            
                                Exception while deleting Spark temp dir in Windows 7 64 bit
                            
                                hadoop 2.2.0 64-bit installing but cannot start
                            
                                identityreducer in the new Hadoop API

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With