Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I run an Apache Spark shell remotely?

Tags:

I have a Spark cluster setup with one master and 3 workers. I also have Spark installed on a CentOS VM. I'm trying to run a Spark shell from my local VM which would connect to the master, and allow me to execute simple Scala code. So, here is the command I run on my local VM:

bin/spark-shell --master spark://spark01:7077

The shell runs to the point where I can enter Scala code. It says that executors have been granted (x3 - one for each worker). If I peek at the Master's UI, I can see one running application, Spark shell. All the workers are ALIVE, have 2 / 2 cores used, and have allocated 512 MB (out of 5 GB) to the application. So, I try to execute the following Scala code:

sc.parallelize(1 to 100).count    

Unfortunately, the command doesn't work. The shell will just print the same warning endlessly:

INFO SparkContext: Starting job: count at <console>:13
INFO DAGScheduler: Got job 0 (count at <console>:13) with 2 output partitions (allowLocal=false)
INFO DAGScheduler: Final stage: Stage 0(count at <console>:13) with 2 output partitions (allowLocal=false)
INFO DAGScheduler: Parents of final stage: List()
INFO DAGScheduler: Missing parents: List()
INFO DAGScheduler: Submitting Stage 0 (Parallel CollectionRDD[0] at parallelize at <console>:13), which has no missing parents
INFO DAGScheduler: Submitting 2 missing tasts from Stage 0 (ParallelCollectionRDD[0] at parallelize at <console>:13)
INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

Following my research into the issue, I have confirmed that the master URL I am using is identical to the one on the web UI. I can ping and ssh both ways (cluster to local VM, and vice-versa). Moreover, I have played with the executor-memory parameter (both increasing and decreasing the memory) to no avail. Finally, I tried disabling the firewall (iptables) on both sides, but I keep getting the same error. I am using Spark 1.0.2.

TL;DR Is it possible to run an Apache Spark shell remotely (and inherently submit applications remotely)? If so, what am I missing?

EDIT: I took a look at the worker logs and found that the workers had trouble finding Spark:

ERROR org.apache.spark.deploy.worker.ExecutorRunner: Error running executor
java.io.IOException: Cannot run program "/usr/bin/spark-1.0.2/bin/compute-classpath.sh" (in directory "."): error=2, No such file or directory
...

Spark is installed in a different directory on my local VM than on the cluster. The path the worker is attempting to find is the one on my local VM. Is there a way for me to specify this path? Or must they be identical everywhere?

For the moment, I adjusted my directories to circumvent this error. Now, my Spark Shell fails before I get the chance to enter the count command (Master removed our application: FAILED). All the workers have the same error:

ERROR akka.remote.EndpointWriter: AssociationError [akka.tcp://sparkWorker@spark02:7078] -> [akka.tcp://sparkExecutor@spark02:53633]:
Error [Association failed with [akka.tcp://sparkExecutor@spark02:53633]] 
[akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@spark02:53633] 
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$annon2: Connection refused: spark02/192.168.64.2:53633

As suspected, I am running into network issues. What should I look at now?

like image 527
Nicolas Avatar asked Oct 31 '14 12:10

Nicolas


People also ask

How do I submit Spark jobs remotely?

Requirements for running jobs remotely using spark-submitProvide network access from the remote host to all Data Proc cluster hosts. Install Hadoop and Spark packages on the remote host. Make sure their versions are similar to the Data Proc cluster host versions. Prepare Hadoop and Spark configuration files.

Where can I run Spark shell?

Launch Spark Shell (spark-shell) CommandGo to the Apache Spark Installation directory from the command line and type bin/spark-shell and press enter, this launches Spark shell and gives you a scala prompt to interact with Spark in scala language.

How do I run Spark standalone?

To install Spark Standalone mode, you simply place a compiled version of Spark on each node on the cluster. You can obtain pre-built versions of Spark with each release or build it yourself.

How do I connect my Spark shell?

You can access the Spark shell by connecting to the master node with SSH and invoking spark-shell . For more information about connecting to the master node, see Connect to the master node using SSH in the Amazon EMR Management Guide. The following examples use Apache HTTP Server access logs stored in Amazon S3.


2 Answers

I solve this problem at my spark client and spark cluster。

Check your network,client A can ping cluster each other! Then add two line config in your spark-env.sh on client A。

first

export SPARK_MASTER_IP=172.100.102.156  
export SPARK_JAR=/usr/spark-1.1.0-bin-hadoop2.4/lib/spark-assembly-1.1.0-hadoop2.4.0.jar

Second

Test your spark shell with cluster mode !

like image 101
Rocketeer Avatar answered Oct 05 '22 13:10

Rocketeer


This problem can be caused by the network configuration. It looks like the error TaskSchedulerImpl: Initial job has not accepted any resources can have quite a few causes (see also this answer):

  • actual resource shortage
  • broken communication between master and workers
  • broken communication between master/workers and driver

The easiest way to exclude the first possibilities is to run a test with a Spark shell running directly on the master. If this works, the cluster communication within the cluster itself is fine and the problem is caused by the communication to the driver host. To further analyze the problem it helps to look into the worker logs, which contain entries like

16/08/14 09:21:52 INFO ExecutorRunner: Launch command: 
    "/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java" 
    ... 
    "--driver-url" "spark://[email protected]:37752"  
    ...

and test whether the worker can establish a connection to the driver's IP/port. Apart from general firewall / port forwarding issues, it might be possible that the driver is binding to the wrong network interface. In this case you can export SPARK_LOCAL_IP on the driver before starting the Spark shell in order to bind to a different interface.

Some additional references:

  • Knowledge base entry on network connectivity issues.
  • Github discussion on improving the documentation of Initial job has not accepted any resources.
like image 24
bluenote10 Avatar answered Oct 05 '22 12:10

bluenote10