Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between hadoop fs -put and hadoop distcp

Tags:

hadoop

We are going to do the ingestion phase in our data lake project and I have mostly used hadoop fs -put throughout my Hadoop developer experience. So what's the difference with hadoop distcp and the difference in usage?

like image 760
oikonomiyaki Avatar asked Mar 30 '17 09:03

oikonomiyaki


People also ask

What is Hadoop DistCp?

DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.

What is difference between CP and DistCp?

2) distcp runs a MR job behind and cp command just invokes the FileSystem copy command for every file. 3) If there are existing jobs running, then distcp might take time depending memory/resources consumed by already running jobs.In this case cp would be better. 4) Also, distcp will work between 2 clusters.

What is in Hadoop FS?

Hadoop Distributed File System (HDFS) The Hadoop Distributed File System (HDFS) is the primary data storage system used by Hadoop applications. HDFS employs a NameNode and DataNode architecture to implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters.

Does DistCp overwrite?

The DistCp -overwrite option overwrites target files even if they exist at the source, or if they have the same contents. The -update and -overwrite options warrant further discussion, since their handling of source-paths varies from the defaults in a very subtle manner.


1 Answers

Distcp is a special tool used for copying the data from one cluster to another. Basically you usually copy from one hdfs to hdfs, but not for local file system. Another very important thing is that the process in done as a mapreduce job with 0 reduce task which makes it more fast due to the distribution of operations. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list

hdfs put - copies the data from local system to hdfs. Uses hdfs client for this behind the scene and does all the work sequentially through accessing NameNode and Datanodes. Does not create MapReduce jobs for processing the data.

like image 101
Alex Avatar answered Sep 21 '22 18:09

Alex