Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using s3distcp with Amazon EMR to copy a single file

I want to copy just a single file to HDFS using s3distcp. I have tried using the srcPattern argument but it didn't help and it keeps on throwing java.lang.Runtime exception. It is possible that the regex I am using is the culprit, please help.

My code is as follows:

elastic-mapreduce -j $jobflow --jar s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar --args '--src,s3://<mybucket>/<path>' --args '--dest,hdfs:///output' --arg --srcPattern --arg '(filename)'

Exception thrown:

Exception in thread "main" java.lang.RuntimeException: Error running job at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:586) at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:216) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at com.amazon.external.elasticmapreduce.s3distcp.Main.main(Main.java:12) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs:/tmp/a088f00d-a67e-4239-bb0d-32b3a6ef0105/files at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1036) at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1028) at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:172) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:944) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:897) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:871) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1308) at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:568) ... 9 more
like image 300
Amar Avatar asked Nov 21 '12 13:11

Amar


2 Answers

DistCp is intended to copy many files using many machines. DistCp is not the right tool if you want to only copy one file.

On the hadoop master node, you can copy a single file using

hadoop fs -cp s3://<mybucket>/<path> hdfs:///output

like image 195
prestomation Avatar answered Nov 15 '22 04:11

prestomation


The regex I was using was indeed the culprit. Say the file names have dates, for example files are like abcd-2013-06-12.gz , then in order to copy ONLY this file, following emr command should do:

elastic-mapreduce -j $jobflow --jar s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar --args '--src,s3:///' --args '--dest,hdfs:///output' --arg --srcPattern --arg '.*2013-06-12.gz'

If I remember correctly, my regex initially was *2013-06-12.gz and not .*2013-06-12.gz. So the dot at the beginning was needed.

like image 32
Amar Avatar answered Nov 15 '22 06:11

Amar