Is it possible to use DistCp to copy only files that match a certain pattern? For example. For /foo I only want *.log files.
DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.
You can use the cp command in Hadoop. This command is similar to the Linux cp command, and it is used for copying files from one directory to another directory within the HDFS file system.
2) distcp runs a MR job behind and cp command just invokes the FileSystem copy command for every file. 3) If there are existing jobs running, then distcp might take time depending memory/resources consumed by already running jobs.In this case cp would be better. 4) Also, distcp will work between 2 clusters.
I realize this is an old thread. But I was interested in the answer to this question myself - and dk89 also asked again in 2013. So here we go:
distcp does not support wildcards. The closest you can do is to:
Find the files you want to copy (sources), filter then using grep, format for hdfs using awk, and output the result to an "input-files" list:
hadoop dfs -lsr hdfs://localhost:9000/path/to/source/dir/
| grep -e webapp.log.3. | awk '{print "hdfs\://localhost\:9000/" $8'} > input-files.txt
Put the input-files list into hdfs
hadoop dfs -put input-files.txt .
Create the target dir
hadoop dfs -mkdir hdfs://localhost:9000/path/to/target/
Run distcp using the input-files list and specifying the target hdfs dir:
hadoop distcp -i -f input-files.txt hdfs://localhost:9000/path/to/target/
DistCp is in fact just a regular map-reduce job: you can use the same globbing syntax as you would use for input of a regular map-reduce job. Generally, you can just use foo/*.log
and that should suffice. You can experiment with hadoop fs -ls
statement here - if globbing works with fs -ls
, then if will work with DistCp (well, almost, but differences are fairly subtle to mention).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With