Using s3distcp with Amazon EMR to copy a single file

Question

I want to copy just a single file to HDFS using s3distcp. I have tried using the srcPattern argument but it didn't help and it keeps on throwing java.lang.Runtime exception. It is possible that the regex I am using is the culprit, please help.

My code is as follows:

elastic-mapreduce -j $jobflow --jar s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar --args '--src,s3://<mybucket>/<path>' --args '--dest,hdfs:///output' --arg --srcPattern --arg '(filename)'

Exception thrown:

Exception in thread "main" java.lang.RuntimeException: Error running job at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:586) at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:216) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at com.amazon.external.elasticmapreduce.s3distcp.Main.main(Main.java:12) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs:/tmp/a088f00d-a67e-4239-bb0d-32b3a6ef0105/files at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1036) at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1028) at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:172) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:944) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:897) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:871) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1308) at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:568) ... 9 more

prestomation · Accepted Answer

DistCp is intended to copy many files using many machines. DistCp is not the right tool if you want to only copy one file.

On the hadoop master node, you can copy a single file using

hadoop fs -cp s3://<mybucket>/<path> hdfs:///output

Amar · Answer

The regex I was using was indeed the culprit. Say the file names have dates, for example files are like abcd-2013-06-12.gz , then in order to copy ONLY this file, following emr command should do:

elastic-mapreduce -j $jobflow --jar s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar --args '--src,s3:///' --args '--dest,hdfs:///output' --arg --srcPattern --arg '.*2013-06-12.gz'

If I remember correctly, my regex initially was *2013-06-12.gz and not .*2013-06-12.gz. So the dot at the beginning was needed.

Using s3distcp with Amazon EMR to copy a single file

Tags:

amazon-s3

hadoop

mapreduce

emr

elastic-map-reduce

Amar

2 Answers

prestomation

Amar

Recent Activity

Donate For Us

Using s3distcp with Amazon EMR to copy a single file

Tags:

amazon-s3

hadoop

mapreduce

emr

elastic-map-reduce

Amar

2 Answers

prestomation

Amar

Related questions

Recent Activity

Donate For Us