Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to run Python Spark code on Amazon Aws?

I have written a python code in spark and I want to run it on Amazon's Elastic Map reduce.

My code works great on my local machine, but I am slightly confused over how to run it on Amazon's AWS?

More specifically, how should I transfer my python code over to the Master node? Do I need to copy my Python code to my s3 bucket and execute it from there? Or, should I ssh into Master and scp my python code to the spark folder in Master?

For now, I tried running the code locally on my terminal and connecting to the cluster address ( I did this by reading the output of --help flag of spark, so I might be missing a few steps here)

./bin/spark-submit --packages org.apache.hadoop:hadoop-aws:2.7.1 \
--master spark://[email protected] \
mypythoncode.py

I tried it with and without my permissions file i.e.

-i permissionsfile.pem

However, it fails and the stack trace shows something on the lines of

Exception in thread "main" java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
    at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)
    at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:49)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    ......
    ......

Is my approach correct and I need to resolve the Access issues to get going or am I heading in a wrong direction?

What is the right way of doing it?

I searched a lot on youtube but couldn't find any tutorials on running Spark on Amazon's EMR.

If it helps, the dataset I am working on it is part of Amazon's public dataset.

like image 664
Piyush Avatar asked Nov 05 '16 22:11

Piyush


2 Answers

  1. go to EMR, create new cluster... [recommendation: start with 1 node only, just for testing purposes].
  2. Click the checkbox to install Spark, you can uncheck the other boxes if you don't need those additional programs.
  3. configure the cluster further by choosing a VPC and a security key (ssh key, a.k.a pem key)
  4. wait for it to boot up. Once your cluster says "waiting", you're free to proceed.
  5. [spark submission via the GUI] in the GUI, you can add a Step and select Spark job, and upload your spark file to S3, and then choose the path to that newly uploaded S3 file. Once it runs it will either succeed or fail. If it fails, wait a moment, and then click "view logs" over on the of that Step line in the list of steps. Keep tweaking your script until you've got it working.

    [submission via the command line] SSH into the driver node following the ssh instructions at the top of the page. Once inside, use a command-line text editor to create a new file, and paste the contents of your script in. Then spark-submit yourNewFile.py. If it fails, you'll see the error output straight to the console. Tweak your script, and re-run. Do that until you've got it working as expected.

Note: running jobs from your local machine to a remote machine is troublesome because you may actually be causing your local instance of spark to be responsible for some expensive computations and data transfer over the network. So thats why you want to submit AWS EMR jobs from within EMR.

like image 107
Kristian Avatar answered Oct 14 '22 15:10

Kristian


There are typical two ways to run a job on an Amazon EMR cluster (whether for Spark or other job types):

  • Login to the master node an run Spark jobs interactively. See: Access the Spark Shell
  • Submit jobs to the EMR cluster. See: Adding a Spark Step

If you have Apache Zeppelin installed on your EMR cluster, you can use a web browser to interact with Spark.

The error you are experiencing is saying that files where accessed via the s3n: protocol, which requires AWS credentials to be provided. If, instead, the files were accessed via s3:, I suspect that the credentials would be sourced from the IAM Role that is automatically assigned to nodes in the cluster and this error would be resolved.

like image 24
John Rotenstein Avatar answered Oct 14 '22 14:10

John Rotenstein