Difference between hadoop fs -put and hadoop distcp

Tags:

hadoop

We are going to do the ingestion phase in our data lake project and I have mostly used hadoop fs -put throughout my Hadoop developer experience. So what's the difference with hadoop distcp and the difference in usage?

760

asked Mar 30 '17 09:03

oikonomiyaki

1 Answers

Distcp is a special tool used for copying the data from one cluster to another. Basically you usually copy from one hdfs to hdfs, but not for local file system. Another very important thing is that the process in done as a mapreduce job with 0 reduce task which makes it more fast due to the distribution of operations. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list

hdfs put - copies the data from local system to hdfs. Uses hdfs client for this behind the scene and does all the work sequentially through accessing NameNode and Datanodes. Does not create MapReduce jobs for processing the data.

101

answered Sep 21 '22 18:09

Alex

Related questions
                            
                                How to remove an ambari service after they have been added
                            
                                What is the difference between classic, local for mapreduce.framework.name in mapred-site.xml?
                            
                                using pyspark, read/write 2D images on hadoop file system
                            
                                How can I merge spark results files without repartition and copyMerge?
                            
                                spark + hadoop data locality
                            
                                How to filter out rows with NaN values in Hive?
                            
                                Can somebody give a high-level, simple explanation to a beginner about how Hadoop works?
                            
                                Chaining multiple mapreduce tasks in Hadoop streaming
                            
                                How do I make Hadoop find imported Python modules when using Python UDFs in Pig?
                            
                                MapReduce - How sort reduce output by value
                            
                                Hadoop reducer not being called
                            
                                Getting the Tool Interface warning even though it is implemented
                            
                                hadoop datanode unable to start. "does not contain a valid host:port authority"
                            
                                write an RDD into HDFS in a spark-streaming context
                            
                                Error: E0505 : E0505: App definition
                            
                                Adding hive jars permanently
                            
                                Spark-Hadoop-> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist
                            
                                cant find start-all.sh in hadoop installation
                            
                                Spark - How many Executors and Cores are allocated to my spark job
                            
                                Accessing S3 from Spark 2.0

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With