I have two HDFS setup and want to copy (not migrate or move) some tables from HDFS1 to HDFS2. How to copy data from one HDFS to another HDFS? Is it possible via Sqoop or other command line?
You can use the cp command in Hadoop. This command is similar to the Linux cp command, and it is used for copying files from one directory to another directory within the HDFS file system.
cp: Copy files from one directory to another within HDFS, similar to Unix cp command.
Copying a file or directory The simplest way to copy a file to or from a cluster is to use the scp command. scp cecicluster:path/to/file. txt . If you want to copy a directory and its content, use the -r option, just like with cp .
DistCp (distributed copy) is a tool used for copying data between clusters. It uses MapReduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.
Usage: $ hadoop distcp <src> <dst>
example: $ hadoop distcp hdfs://nn1:8020/file1 hdfs://nn2:8020/file2
file1
from nn1
is copied to nn2
with filename file2
Distcp is the best tool as of now. Sqoop is used to copy data from relational database to HDFS and vice versa, but not between HDFS to HDFS.
More info:
There are two versions available - runtime performance in distcp2
is more compared to distcp
Hadoop comes with a useful program called distcp
for copying large amounts of data to and from Hadoop Filesystems in parallel. The canonical use case for distcp is for transferring data between two HDFS clusters. If the clusters are running identical versions of hadoop, then the hdfs scheme is appropriate to use.
$ hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar
The data in /foo
directory of namenode1 will be copied to /bar directory of namenode2. If the /bar
directory does not exist, it will create it. Also we can mention multiple source paths.
Similar to rsync
command, distcp command by default will skip the files that already exist. We can also use -overwrite
option to overwrite the existing files in destination directory. The option -update
will only update the files that have changed.
$ hadoop distcp -update hdfs://namenode1/foo hdfs://namenode2/bar/foo
distcp
can also be implemented as a MapReduce job where the work of copying is done by the maps that run in parallel across the cluster. There will be no reducers.
If trying to copy data between two HDFS clusters that are running different versions, the copy will process will fail, since the RPC systems are incompatible. In that case we need to use the read-only HTTP based HFTP filesystems to read from the source. Here the job has to run on destination cluster.
$ hadoop distcp hftp://namenode1:50070/foo hdfs://namenode2/bar
50070 is the default port number for namenode's embedded web server.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With