What is the difference between
hadoop distcp
and
hadoop distcp -update
Both of them would do the same work with only slight difference in how we call them. None of them overwrites an already existing file in the destination. What's the point then in two different set of commands?
The difference between distcp and distcp -update is that distcp by default skips files while "distcp -update" will update a file if src size is different from dst size.
It's a bit confusing in documentation, since the default nature of distcp is to skip if a file exists to prevent collision.
From the docs:
"As noted in the preceding, this is not a "sync" operation. The only criterion examined is the source and destination file sizes; if they differ, the source file replaces the destination file."
Keep in mind -update
is not a delta-xfer algo like rsync and only does a size check, which isn't perfect when files are all the same size yet data is different.
I should also elaborate some and explain that distcp -overwrite
will overwrite the file no matter whether the size matches or not. It's a destructive process, so make sure that you really want to do this.
Some great examples can be found here: http://hadoop.apache.org/common/docs/r0.19.2/distcp.html#uo
I also want to give an example of what I do in a sync operation between two clusters:
hadoop distcp -pugp -i -delete -update hftp://hdfs-nn1:50070/clustera hdfs://hdfs-nn2:9000/clustera
This will update all files in hdfs-nn2 that don't match in size from hdfs-nn1, as well as delete any extraneous files. If using .Trash, then any files deleted are placed in your Trash of user invoking distcp.
I would experiment with it a bit so you can see the effect of various commands, since it can be painful when you accidentally wipe out TBs of data so definitely use your Trash.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With