Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop DistCp using wildcards?

Tags:

hadoop

Is it possible to use DistCp to copy only files that match a certain pattern? For example. For /foo I only want *.log files.

like image 990
zzztimbo Avatar asked Apr 18 '11 21:04

zzztimbo


People also ask

How does Distcp work in Hadoop?

DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.

What is the best way to copy files between HDFS clusters?

You can use the cp command in Hadoop. This command is similar to the Linux cp command, and it is used for copying files from one directory to another directory within the HDFS file system.

What is difference between cp and Distcp?

2) distcp runs a MR job behind and cp command just invokes the FileSystem copy command for every file. 3) If there are existing jobs running, then distcp might take time depending memory/resources consumed by already running jobs.In this case cp would be better. 4) Also, distcp will work between 2 clusters.


2 Answers

I realize this is an old thread. But I was interested in the answer to this question myself - and dk89 also asked again in 2013. So here we go:

distcp does not support wildcards. The closest you can do is to:

Find the files you want to copy (sources), filter then using grep, format for hdfs using awk, and output the result to an "input-files" list:

hadoop dfs -lsr hdfs://localhost:9000/path/to/source/dir/ 
  | grep -e webapp.log.3. | awk '{print "hdfs\://localhost\:9000/" $8'}   > input-files.txt

Put the input-files list into hdfs

hadoop dfs -put input-files.txt  .

Create the target dir

hadoop dfs -mkdir hdfs://localhost:9000/path/to/target/

Run distcp using the input-files list and specifying the target hdfs dir:

hadoop distcp -i -f input-files.txt hdfs://localhost:9000/path/to/target/  
like image 135
WestCoastProjects Avatar answered Sep 18 '22 09:09

WestCoastProjects


DistCp is in fact just a regular map-reduce job: you can use the same globbing syntax as you would use for input of a regular map-reduce job. Generally, you can just use foo/*.log and that should suffice. You can experiment with hadoop fs -ls statement here - if globbing works with fs -ls, then if will work with DistCp (well, almost, but differences are fairly subtle to mention).

like image 24
GreyCat Avatar answered Sep 20 '22 09:09

GreyCat