Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use S3DistCp to copy file from S3 to EMR

I am struggling to find a way to use S3DistCp in my AWS EMR Cluster.

Some old examples which show how to add s3distcp as an EMR step use elastic-mapreduce command which is not used anymore.

Some other sources suggest to use s3-dist-cp command, which is not found in current EMR clusters. Even official documentation (online and EMR developer guide 2016 pdf) present an example like this:

aws emr add-steps --cluster-id j-3GYXXXXXX9IOK --steps Type=CUSTOM_JAR,Name="S3DistCp step",Jar=/home/hadoop/lib/emr-s3distcp-1.0.jar,Args=["--s3Endpoint,s3-eu-west-1.amazonaws.com","--src,s3://mybucket/logs/j-3GYXXXXXX9IOJ/node/","--dest,hdfs:///output","--srcPattern,.*[azA-Z,]+"]

But there is no lib folder in the /home/hadoop path. I found some hadoop libraries in this folder: /usr/lib/hadoop/lib, but I cannot find s3distcp from anywhere.

Then I found that there are some libraries available in some S3 buckets. For example, from this question, I found this path: s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar. This seemed to be a step in the right direction, as adding a new step to a running EMR cluster from the AWS interface with these parameters started the step (which it didn't with previous attempts) but failed after ~15seconds:

JAR location: s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar
Main class: None
Arguments: --s3Endpoint s3-eu-west-1.amazonaws.com --src s3://source-bucket/scripts/ --dest hdfs:///output
Action on failure: Continue

This resulted in the following error:

Exception in thread "main" java.lang.RuntimeException: Unable to retrieve Hadoop configuration for key fs.s3n.awsAccessKeyId
    at com.amazon.external.elasticmapreduce.s3distcp.ConfigurationCredentials.getConfigOrThrow(ConfigurationCredentials.java:29)
    at com.amazon.external.elasticmapreduce.s3distcp.ConfigurationCredentials.<init>(ConfigurationCredentials.java:35)
    at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.createInputFileListS3(S3DistCp.java:85)
    at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.createInputFileList(S3DistCp.java:60)
    at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:529)
    at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:216)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
    at com.amazon.external.elasticmapreduce.s3distcp.Main.main(Main.java:12)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

I thought this may have been caused by the incompatibility of my S3 location (same as the endpoint) and the location of the s3distcp script, which was from us-east. I replaced it with eu-west-1 and still got the same error about the authentication. I have used a similar setup to run my scala scripts (Custom jar type with "command-runner.jar" script with the first argument "spark-submit" to run a spark job and this works, I have not had this problem with the authentication before.

What is the simplest way to copy a file from S3 to an EMR cluster? Either by adding an additional EMR step with AWS SDK (for Go lang) or somehow inside the Scala spark script? Or from the AWS EMR interface, but not from CLI as I need it to be automated.

like image 877
V. Samma Avatar asked Sep 08 '16 11:09

V. Samma


People also ask

How do I transfer files from S3 bucket to local machine?

You can use cp to copy the files from an s3 bucket to your local system. Use the following command: $ aws s3 cp s3://bucket/folder/file.txt .

How do I extract files from S3 bucket?

In the Amazon S3 console, choose your S3 bucket, choose the file that you want to open or download, choose Actions, and then choose Open or Download. If you are downloading an object, specify where you want to save it.


2 Answers

The CLI that comes installed in EMR is aws <servicename> <function>:

enter image description here

aws s3 cp s3://bucket/path/to/remote/file.sh /local/path/to/file.sh

https://aws.amazon.com/cli/

As far as automating that, its certainly reasonable to throw your commands into a custom step where the "path" to the command is simply "command-runner.jar" and then the arg of the step is the command itself.

So, ultimately, CLI code can do the same thing:

aws emr add-steps --cluster-id j-2AXXXXXXGAPLF --steps Name="Command Runner",Jar="command-runner.jar",Args=["spark-submit","Args..."]

http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-commandrunner.html

like image 105
Kristian Avatar answered Nov 05 '22 08:11

Kristian


aws emr add-steps --profile <> --cluster-id <> --steps Type=CUSTOM_JAR,Name=UPLOAD_JAR_CONFIG,ActionOnFailure=CANCEL_AND_WAIT,Jar=command-runner.jar,Args=[s3-dist-cp,--src,s3a://<>/,--dest,hdfs:///<>/<>/,--srcPattern=.*.*]

Thanks for previous answers. I was stuck but was able to build this to use dist-cp to copy to emr from s3

like image 42
supermonk Avatar answered Nov 05 '22 08:11

supermonk