Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cannot do simple task on ec2 spark cluster from local pyspark

I am trying to execute pyspark from my mac to do compute on a EC2 spark cluster.
If I login to the cluster, it works as expected:

$ ec2/spark-ec2 -i ~/.ec2/spark.pem -k spark login test-cluster2
$ spark/bin/pyspark

Then do a simple task

>>> data=sc.parallelize(range(1000),10)`
>>> data.count()

Works as expected:

14/06/26 16:38:52 INFO spark.SparkContext: Starting job: count at <stdin>:1
14/06/26 16:38:52 INFO scheduler.DAGScheduler: Got job 0 (count at <stdin>:1) with 10 output partitions (allowLocal=false)
14/06/26 16:38:52 INFO scheduler.DAGScheduler: Final stage: Stage 0 (count at <stdin>:1)
...
14/06/26 16:38:53 INFO spark.SparkContext: Job finished: count at <stdin>:1, took 1.195232619 s
1000

But now if I try the same thing from local machine,

$ MASTER=spark://ec2-54-234-204-13.compute-1.amazonaws.com:7077 bin/pyspark

it can't seem to connect to the cluster

14/06/26 09:45:43 INFO AppClient$ClientActor: Connecting to master spark://ec2-54-234-204-13.compute-1.amazonaws.com:7077...
14/06/26 09:45:47 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
...
  File "/Users/anthony1/git/incubator-spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o20.collect.
: org.apache.spark.SparkException: Job aborted: Spark cluster looks down
14/06/26 09:53:17 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

I thought the problem was in the ec2 security but it does not help even after adding inbound rules to both master and slave security groups to accept all ports.

Any help will be greatly appreciated!

Others are asking same question on mailing list http://apache-spark-user-list.1001560.n3.nabble.com/Deploying-a-python-code-on-a-spark-EC2-cluster-td4758.html#a8465

like image 234
Anthony Avatar asked Jun 26 '14 21:06

Anthony


People also ask

How do I submit a PySpark job in cluster mode?

You can submit a Spark batch application by using cluster mode (default) or client mode either inside the cluster or from an external client: Cluster mode (default): Submitting Spark batch application and having the driver run on a host in your driver resource group. The spark-submit syntax is --deploy-mode cluster.

How do I run a spark application from an ec2 instance?

Running ApplicationsGo into the ec2 directory in the release of Spark you downloaded. Run ./spark-ec2 -k <keypair> -i <key-file> login <cluster-name> to SSH into the cluster, where <keypair> and <key-file> are as above. (This is just for convenience; you could also use the EC2 console.)

In which situation should you not run spark on an EMR cluster?

Avoid large shuffles in Spark To reduce the amount of data that Spark needs to reprocess if a Spot Instance is interrupted in your Amazon EMR cluster, you should avoid large shuffles. Wide dependency operations like GroupBy and some types of joins can produce vast amounts of intermediate data.


2 Answers

The spark-ec2 script configure the Spark Cluster in EC2 as standalone, which mean it can not work with remote submits. I've been struggled with this same error you described for days before figure out it's not supported. The message error is unfortunately incorrect.

So you have to copy your stuff and log into the master to execute your spark task.

like image 58
Felix Avatar answered Oct 24 '22 05:10

Felix


In my experience Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory usually means you have accidentally set the cores too high, or set the executer memory too high - i.e. higher than what your nodes actually have.

Other, less likely causes, could be you got the URI wrong and your not really connecting to the master. And once I saw that problem when the /run partition was 100%.

Even less likely, your cluster may actually be down, and you need to restart your spark workers.

like image 41
samthebest Avatar answered Oct 24 '22 06:10

samthebest