Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How Hadoop -getmerge works?

In hadoop getmerge description

Usage: hdfs dfs -getmerge src localdst [addnl]

My question is why getmerge is concatenating to the local destination why not hdfs itself ? This question was asked because i have this following problems

  1. What if the files to be merged are more than the size of the local?
  2. Is there any specific reason behind restricting hadoop -getmerge command to only to concatenate to local-destination?
like image 738
monic Avatar asked Apr 15 '16 06:04

monic


People also ask

What are the 4 main components of the Hadoop architecture?

There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common. Most of the tools or solutions are used to supplement or support these major elements.

How does Hadoop store data?

How Does HDFS Store Data? HDFS divides files into blocks and stores each block on a DataNode. Multiple DataNodes are linked to the master node in the cluster, the NameNode. The master node distributes replicas of these data blocks across the cluster.

How does Hadoop handle big data?

HDFS is made for handling large files by dividing them into blocks, replicating them, and storing them in the different cluster nodes. Thus, its ability to be highly fault-tolerant and reliable. HDFS is designed to store large datasets in the range of gigabytes or terabytes, or even petabytes.


1 Answers

The getmerge command has been created specifically for merging files from HDFS into a single file on local file system.

This command is very useful to download the output of a MapReduce job, which could have generated multiple part-* files and combine them into a single file locally, which you can use for other operations (for e.g. put it in an Excel sheet for presentation).

Answers to your questions:

  1. If the destination file system does not have enough space, then IOException is thrown. The getmerge internally uses IOUtils.copyBytes() (see IOUtils.copyBytes()) function to copy one file at a time from HDFS to local file. This function throws IOException whenever there is an error in the copy operation.

  2. This command is on similar lines as hdfs fs -get command which gets the file from HDFS to local file system. Only difference is hdfs fs -getmerge merges multiple files from HDFS to local file system.

If you want to merge multiple files in HDFS, you can achieve it using copyMerge() method from FileUtil class (see FileUtil.copyMerge()).

This API copies all files in a directory to a single file (merges all the source files).

like image 119
Manjunath Ballur Avatar answered Oct 29 '22 23:10

Manjunath Ballur