Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to Launch Spark 2.0 on EC2

With the release of Spark 2.0 today they have removed native support for launching a Spark EC2 cluster on AWS:

https://spark.apache.org/releases/spark-release-2-0-0.html#removals-behavior-changes-and-deprecations

Spark EC2 script has been fully moved to an external repository hosted by the UC Berkeley AMPLab

On the AMPLab GitHub page it includes these instructions:

https://github.com/amplab/spark-ec2/tree/branch-2.0#launching-a-cluster

Go into the ec2 directory in the release of Apache Spark you downloaded.

The problem is there is no ec2 folder in the 2.0 download. Anyone know how I can launch a Spark 2.0 cluster in EC2?

Thanks in advance.

like image 685
Frank B. Avatar asked Jul 27 '16 11:07

Frank B.


1 Answers

LAST EDIT

For anyone having this issue, the answer is simpler: here.

EDIT 2

I realized after first edit that it is slightly more convoluted, so here's a new edit about for anyone that might find it useful in the future.

The issue is that Spark does no longer provide the ec2 directory as part of the official distribution. If you're used to spinning up your standalone clusters this way it is an issue.

The solution is simple:

  1. Download the official ec2 directory as detailed in the Spark 2.0.0 documentation.
  2. If you just copy the dir to your Spark 2.0.0 and run the spark-ec2 executable to mimic the way things worked in Spark 1.*, you will be able to spin up your cluster as usual. But when you ssh into it you'll realize that none of the binaries are there anymore.
  3. So, once you spin up your cluster (as you normally would with the spark-ec2 you downloaded in step 1), you'll have to rsync your local directory containing Spark 2.0.0 into the master of your newly created cluster. Once this is done, you can spark-submit jobs as you normally do.

Really simple but it seems to me the Spark docs could be clear about this for all of us normies.


EDIT: This was in fact the right thing to do. For anyone having the same question: download the ec2 dir from AMPLab like Spark suggests, put this folder inside your local Spark-2.0.0 dir, and fire-up scripts as usual. Apparently they only decoupled the directory for maintenance purposes, but the logic is still the same. Would be nice to have a few words about it in the Spark docs.


I tried the following: cloned the spark-ec2-branch-1.6 directory from the AMPLab link into my spark-2.0.0 directory, and attempted to launch a cluster with the usual ./ec2/spark-ec2 command. Maybe that's what they want us to do?

I'm launchng a small 16 node cluster. I can see it in the AWS dashboard but the terminal has been stuck printing the usual SSH error for the past... almost two hours.

Warning: SSH connection error. (This could be temporary.) Host: ec2-54-165-25-18.compute-1.amazonaws.com SSH return code: 255 SSH output: ssh: connect to host ec2-54-165-25-18.compute-1.amazonaws.com port 22: Connection refused

Will update if I find anything useful.

like image 76
xv70 Avatar answered Oct 16 '22 10:10

xv70