I am struggling to find a way to use S3DistCp in my AWS EMR Cluster.
Some old examples which show how to add s3distcp as an EMR step use elastic-mapreduce
command which is not used anymore.
Some other sources suggest to use s3-dist-cp
command, which is not found in current EMR clusters. Even official documentation (online and EMR developer guide 2016 pdf) present an example like this:
aws emr add-steps --cluster-id j-3GYXXXXXX9IOK --steps Type=CUSTOM_JAR,Name="S3DistCp step",Jar=/home/hadoop/lib/emr-s3distcp-1.0.jar,Args=["--s3Endpoint,s3-eu-west-1.amazonaws.com","--src,s3://mybucket/logs/j-3GYXXXXXX9IOJ/node/","--dest,hdfs:///output","--srcPattern,.*[azA-Z,]+"]
But there is no lib
folder in the /home/hadoop
path. I found some hadoop libraries in this folder: /usr/lib/hadoop/lib
, but I cannot find s3distcp
from anywhere.
Then I found that there are some libraries available in some S3 buckets. For example, from this question, I found this path: s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar
. This seemed to be a step in the right direction, as adding a new step to a running EMR cluster from the AWS interface with these parameters started the step (which it didn't with previous attempts) but failed after ~15seconds:
JAR location: s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar
Main class: None
Arguments: --s3Endpoint s3-eu-west-1.amazonaws.com --src s3://source-bucket/scripts/ --dest hdfs:///output
Action on failure: Continue
This resulted in the following error:
Exception in thread "main" java.lang.RuntimeException: Unable to retrieve Hadoop configuration for key fs.s3n.awsAccessKeyId
at com.amazon.external.elasticmapreduce.s3distcp.ConfigurationCredentials.getConfigOrThrow(ConfigurationCredentials.java:29)
at com.amazon.external.elasticmapreduce.s3distcp.ConfigurationCredentials.<init>(ConfigurationCredentials.java:35)
at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.createInputFileListS3(S3DistCp.java:85)
at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.createInputFileList(S3DistCp.java:60)
at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:529)
at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:216)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at com.amazon.external.elasticmapreduce.s3distcp.Main.main(Main.java:12)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
I thought this may have been caused by the incompatibility of my S3 location (same as the endpoint) and the location of the s3distcp script, which was from us-east. I replaced it with eu-west-1 and still got the same error about the authentication. I have used a similar setup to run my scala scripts (Custom jar type with "command-runner.jar" script with the first argument "spark-submit" to run a spark job and this works, I have not had this problem with the authentication before.
What is the simplest way to copy a file from S3 to an EMR cluster? Either by adding an additional EMR step with AWS SDK (for Go lang) or somehow inside the Scala spark script? Or from the AWS EMR interface, but not from CLI as I need it to be automated.
You can use cp to copy the files from an s3 bucket to your local system. Use the following command: $ aws s3 cp s3://bucket/folder/file.txt .
In the Amazon S3 console, choose your S3 bucket, choose the file that you want to open or download, choose Actions, and then choose Open or Download. If you are downloading an object, specify where you want to save it.
The CLI that comes installed in EMR is aws <servicename> <function>
:
aws s3 cp s3://bucket/path/to/remote/file.sh /local/path/to/file.sh
https://aws.amazon.com/cli/
As far as automating that, its certainly reasonable to throw your commands into a custom step where the "path" to the command is simply "command-runner.jar" and then the arg of the step is the command itself.
So, ultimately, CLI code can do the same thing:
aws emr add-steps --cluster-id j-2AXXXXXXGAPLF --steps Name="Command Runner",Jar="command-runner.jar",Args=["spark-submit","Args..."]
http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-commandrunner.html
aws emr add-steps --profile <> --cluster-id <> --steps Type=CUSTOM_JAR,Name=UPLOAD_JAR_CONFIG,ActionOnFailure=CANCEL_AND_WAIT,Jar=command-runner.jar,Args=[s3-dist-cp,--src,s3a://<>/,--dest,hdfs:///<>/<>/,--srcPattern=.*.*]
Thanks for previous answers. I was stuck but was able to build this to use dist-cp to copy to emr from s3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With