Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between 'distcp' and 'distcp -update'?

What is the difference between

hadoop distcp

and

hadoop distcp -update

Both of them would do the same work with only slight difference in how we call them. None of them overwrites an already existing file in the destination. What's the point then in two different set of commands?

like image 548
Harsh Avatar asked Jan 06 '11 01:01

Harsh


1 Answers

The difference between distcp and distcp -update is that distcp by default skips files while "distcp -update" will update a file if src size is different from dst size.

It's a bit confusing in documentation, since the default nature of distcp is to skip if a file exists to prevent collision.

From the docs:

"As noted in the preceding, this is not a "sync" operation. The only criterion examined is the source and destination file sizes; if they differ, the source file replaces the destination file."

Keep in mind -update is not a delta-xfer algo like rsync and only does a size check, which isn't perfect when files are all the same size yet data is different.

I should also elaborate some and explain that distcp -overwrite will overwrite the file no matter whether the size matches or not. It's a destructive process, so make sure that you really want to do this.

Some great examples can be found here: http://hadoop.apache.org/common/docs/r0.19.2/distcp.html#uo

I also want to give an example of what I do in a sync operation between two clusters:

hadoop distcp -pugp -i -delete -update hftp://hdfs-nn1:50070/clustera hdfs://hdfs-nn2:9000/clustera

This will update all files in hdfs-nn2 that don't match in size from hdfs-nn1, as well as delete any extraneous files. If using .Trash, then any files deleted are placed in your Trash of user invoking distcp.

I would experiment with it a bit so you can see the effect of various commands, since it can be painful when you accidentally wipe out TBs of data so definitely use your Trash.

like image 178
SysEngAtl Avatar answered Nov 11 '22 11:11

SysEngAtl