Use S3DistCp to copy file from S3 to EMR

Tags:

I am struggling to find a way to use S3DistCp in my AWS EMR Cluster.

Some old examples which show how to add s3distcp as an EMR step use elastic-mapreduce command which is not used anymore.

Some other sources suggest to use s3-dist-cp command, which is not found in current EMR clusters. Even official documentation (online and EMR developer guide 2016 pdf) present an example like this:

aws emr add-steps --cluster-id j-3GYXXXXXX9IOK --steps Type=CUSTOM_JAR,Name="S3DistCp step",Jar=/home/hadoop/lib/emr-s3distcp-1.0.jar,Args=["--s3Endpoint,s3-eu-west-1.amazonaws.com","--src,s3://mybucket/logs/j-3GYXXXXXX9IOJ/node/","--dest,hdfs:///output","--srcPattern,.*[azA-Z,]+"]

But there is no lib folder in the /home/hadoop path. I found some hadoop libraries in this folder: /usr/lib/hadoop/lib, but I cannot find s3distcp from anywhere.

Then I found that there are some libraries available in some S3 buckets. For example, from this question, I found this path: s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar. This seemed to be a step in the right direction, as adding a new step to a running EMR cluster from the AWS interface with these parameters started the step (which it didn't with previous attempts) but failed after ~15seconds:

JAR location: s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar
Main class: None
Arguments: --s3Endpoint s3-eu-west-1.amazonaws.com --src s3://source-bucket/scripts/ --dest hdfs:///output
Action on failure: Continue

This resulted in the following error:

Exception in thread "main" java.lang.RuntimeException: Unable to retrieve Hadoop configuration for key fs.s3n.awsAccessKeyId
    at com.amazon.external.elasticmapreduce.s3distcp.ConfigurationCredentials.getConfigOrThrow(ConfigurationCredentials.java:29)
    at com.amazon.external.elasticmapreduce.s3distcp.ConfigurationCredentials.<init>(ConfigurationCredentials.java:35)
    at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.createInputFileListS3(S3DistCp.java:85)
    at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.createInputFileList(S3DistCp.java:60)
    at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:529)
    at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:216)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
    at com.amazon.external.elasticmapreduce.s3distcp.Main.main(Main.java:12)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

I thought this may have been caused by the incompatibility of my S3 location (same as the endpoint) and the location of the s3distcp script, which was from us-east. I replaced it with eu-west-1 and still got the same error about the authentication. I have used a similar setup to run my scala scripts (Custom jar type with "command-runner.jar" script with the first argument "spark-submit" to run a spark job and this works, I have not had this problem with the authentication before.

What is the simplest way to copy a file from S3 to an EMR cluster? Either by adding an additional EMR step with AWS SDK (for Go lang) or somehow inside the Scala spark script? Or from the AWS EMR interface, but not from CLI as I need it to be automated.

877

asked Sep 08 '16 11:09

V. Samma

2 Answers

The CLI that comes installed in EMR is aws <servicename> <function>:

enter image description here

aws s3 cp s3://bucket/path/to/remote/file.sh /local/path/to/file.sh

https://aws.amazon.com/cli/

As far as automating that, its certainly reasonable to throw your commands into a custom step where the "path" to the command is simply "command-runner.jar" and then the arg of the step is the command itself.

So, ultimately, CLI code can do the same thing:

aws emr add-steps --cluster-id j-2AXXXXXXGAPLF --steps Name="Command Runner",Jar="command-runner.jar",Args=["spark-submit","Args..."]

http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-commandrunner.html

105

answered Nov 05 '22 08:11

Kristian

aws emr add-steps --profile <> --cluster-id <> --steps Type=CUSTOM_JAR,Name=UPLOAD_JAR_CONFIG,ActionOnFailure=CANCEL_AND_WAIT,Jar=command-runner.jar,Args=[s3-dist-cp,--src,s3a://<>/,--dest,hdfs:///<>/<>/,--srcPattern=.*.*]

Thanks for previous answers. I was stuck but was able to build this to use dist-cp to copy to emr from s3

answered Nov 05 '22 08:11

supermonk

Related questions
                            
                                Paperclip cannot find ImageMagick on AWS Elastic Beanstalk
                            
                                How to copy data in bulk from Kinesis -> Redshift
                            
                                AWS SDK v2 for s3
                            
                                Presigned AWS S3 PUT url fails to upload from client using jquery
                            
                                How to map filenames to RDD using sc.textFile("s3n://bucket/*.csv")?
                            
                                How To Decrypt AWS Ruby Client-side Encryption in Python
                            
                                Spark Streaming on a S3 Directory
                            
                                How to setup AWS credentials in rails for development?
                            
                                'Cloud' filesystem storage does not work - Laravel 5.1
                            
                                Resizing uploaded image in Laravel and store in S3 is not working
                            
                                How can I use Net::Http to download a file with UTF-8 characters in it?
                            
                                SQL Server Job won't recognize AWS CLI command
                            
                                Redshift COPY command with "^A" delimiter
                            
                                Amazon S3 bucket policy for uploading and viewing pictures
                            
                                Does Amazon offer a way to reserve AWS S3 bucket name prefixes?
                            
                                How to configure AWS S3 to allow POST to work like GET
                            
                                HaProxy Transparent Proxy To AWS S3 Static Website Page
                            
                                How do I use AWS S3 without Amazon Cognito?
                            
                                Terraform remote state s3 bucket creation included in the state file?
                            
                                Amazon S3 mp4 video not playing in HTML video tag [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Use S3DistCp to copy file from S3 to EMR

Tags:

amazon-s3

aws-sdk

amazon-emr

elastic-map-reduce

s3distcp

V. Samma

People also ask

2 Answers

Kristian

supermonk

Recent Activity

Donate For Us