I have two HDFS setup and want to copy (not migrate or move) some tables from HDFS1 to HDFS2. How to copy data from one HDFS to another HDFS? Is it possible via Sqoop or other command line?

DistCp (distributed copy) is a tool used for copying data between clusters. It uses MapReduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list. Usage: <code>$ hadoop distcp <src> <dst></code> example: <code>$ hadoop distcp hdfs://nn1:8020/file1 hdfs://nn2:8020/file2</code> <code>file1</code> from <code>nn1</code> is copied to <code>nn2</code> with filename <code>file2</code> Distcp is the best tool as of now. Sqoop is used to copy data from relational database to HDFS and vice versa, but not between HDFS to HDFS. More info: <ul> <li>http://hadoop.apache.org/docs/r1.2.1/distcp.html</li> <li>http://hadoop.apache.org/docs/r1.2.1/distcp2.html</li> </ul> There are two versions available - runtime performance in <code>distcp2</code> is more compared to <code>distcp</code>

Hadoop comes with a useful program called <code>distcp</code> for copying large amounts of data to and from Hadoop Filesystems in parallel. The canonical use case for distcp is for transferring data between two HDFS clusters. If the clusters are running identical versions of hadoop, then the hdfs scheme is appropriate to use. <pre class="prettyprint"><code>$ hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar </code></pre> The data in <code>/foo</code> directory of namenode1 will be copied to /bar directory of namenode2. If the <code>/bar</code> directory does not exist, it will create it. Also we can mention multiple source paths. Similar to <code>rsync</code> command, distcp command by default will skip the files that already exist. We can also use <code>-overwrite</code> option to overwrite the existing files in destination directory. The option <code>-update</code> will only update the files that have changed. <pre class="prettyprint"><code>$ hadoop distcp -update hdfs://namenode1/foo hdfs://namenode2/bar/foo </code></pre> <code>distcp</code> can also be implemented as a MapReduce job where the work of copying is done by the maps that run in parallel across the cluster. There will be no reducers. If trying to copy data between two HDFS clusters that are running different versions, the copy will process will fail, since the RPC systems are incompatible. In that case we need to use the read-only HTTP based HFTP filesystems to read from the source. Here the job has to run on destination cluster. <pre class="prettyprint"><code>$ hadoop distcp hftp://namenode1:50070/foo hdfs://namenode2/bar </code></pre> 50070 is the default port number for namenode's embedded web server.

How to copy data from one HDFS to another HDFS?

2 Answers

DistCp (distributed copy) is a tool used for copying data between clusters. It uses MapReduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.

Usage: $ hadoop distcp <src> <dst>

example: $ hadoop distcp hdfs://nn1:8020/file1 hdfs://nn2:8020/file2

file1 from nn1 is copied to nn2 with filename file2

Distcp is the best tool as of now. Sqoop is used to copy data from relational database to HDFS and vice versa, but not between HDFS to HDFS.

More info:

http://hadoop.apache.org/docs/r1.2.1/distcp.html
http://hadoop.apache.org/docs/r1.2.1/distcp2.html

There are two versions available - runtime performance in distcp2 is more compared to distcp

answered Sep 23 '22 22:09

Avinash Reddy

Hadoop comes with a useful program called distcp for copying large amounts of data to and from Hadoop Filesystems in parallel. The canonical use case for distcp is for transferring data between two HDFS clusters. If the clusters are running identical versions of hadoop, then the hdfs scheme is appropriate to use.

$ hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar

The data in /foo directory of namenode1 will be copied to /bar directory of namenode2. If the /bar directory does not exist, it will create it. Also we can mention multiple source paths.

Similar to rsync command, distcp command by default will skip the files that already exist. We can also use -overwrite option to overwrite the existing files in destination directory. The option -update will only update the files that have changed.

$ hadoop distcp -update hdfs://namenode1/foo hdfs://namenode2/bar/foo

distcp can also be implemented as a MapReduce job where the work of copying is done by the maps that run in parallel across the cluster. There will be no reducers.

If trying to copy data between two HDFS clusters that are running different versions, the copy will process will fail, since the RPC systems are incompatible. In that case we need to use the read-only HTTP based HFTP filesystems to read from the source. Here the job has to run on destination cluster.

$ hadoop distcp hftp://namenode1:50070/foo hdfs://namenode2/bar

50070 is the default port number for namenode's embedded web server.

answered Sep 22 '22 22:09

Baban Gaigole

Related questions
                            
                                Where does Hadoop store the logs of YARN applications?
                            
                                Exception while deleting Spark temp dir in Windows 7 64 bit
                            
                                hadoop 2.2.0 64-bit installing but cannot start
                            
                                identityreducer in the new Hadoop API
                            
                                Merging hdfs files
                            
                                Role of datanode, regionserver in Hbase-hadoop integration
                            
                                Difference between Application Manager and Application Master in YARN?
                            
                                How to get names of the currently running hadoop jobs?
                            
                                How does Hadoop Namenode failover process works?
                            
                                How to change date format in hive?
                            
                                Iterate twice on values (MapReduce)
                            
                                Does Hive have something equivalent to DUAL?
                            
                                Hadoop input split size vs block size
                            
                                How to unzip .gz files in a new directory in hadoop?
                            
                                What is sequence file in hadoop?
                            
                                Books to start learning big data [closed]
                            
                                Unable to start cygwin sshd service
                            
                                How to check if Hadoop daemons are running?
                            
                                hadoop fs -put command
                            
                                What does msck stands for in Msck repair command

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to copy data from one HDFS to another HDFS?

Tags:

hadoop

hdfs

bigdata

sqoop

sharp

People also ask

2 Answers

Avinash Reddy

Baban Gaigole

Recent Activity

Donate For Us