Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop DistCp handle same file name by renaming

Is there any way to run DistCp, but with an option to rename on file name collisions? Maybe it's easiest to explain with an example.

Let's say I'm copying to hdfs:///foo to hdfs:///bar, and foo contains these files:

hdfs:///foo/a
hdfs:///foo/b
hdfs:///foo/c

and bar contains these:

hdfs:///bar/a
hdfs:///bar/b

Then after the copy, I'd like bar to contain something like:

hdfs:///bar/a
hdfs:///bar/a-copy1
hdfs:///bar/b
hdfs:///bar/b-copy1
hdfs:///bar/c

If there is no such option, what might be the most reliable/efficient way to do this? My own home-grown version of distcp could certainly get it done, but that seems like it could be a lot of work and pretty error-prone. Basically, I don't care at all about the file names, just their directory, and I want to periodically copy large amounts of data into a "consolidation" directory.

like image 558
Joe K Avatar asked Nov 11 '22 08:11

Joe K


1 Answers

Distcp does not have that option. If you are using the Java API for it, it can be easily handled by checking if the destination path exist and changing the path in case it already exists. You can check that with a FileSystem object using the method exists(Path p).

like image 90
Leonardo Neves Avatar answered Nov 15 '22 09:11

Leonardo Neves