Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark 2.0 Standalone mode Dynamic Resource Allocation Worker Launch Error

I'm running Spark 2.0 on Standalone mode, successfully configured it to launch on a server and also was able to configure Ipython Kernel PySpark as option into Jupyter Notebook. Everything works fine but I'm facing the problem that for each Notebook that I launch, all of my 4 workers are assigned to that application. So if another person from my team try to launch another Notebook with PySpark kernel, it simply does not work until I stop the first notebook and release all the workers.

To solve this problem I'm trying to follow the instructions from Spark 2.0 Documentation. So, on my $SPARK_HOME/conf/spark-defaults.conf I have the following lines:

spark.dynamicAllocation.enabled    true
spark.shuffle.service.enabled      true
spark.dynamicAllocation.executorIdleTimeout    10

Also, on $SPARK_HOME/conf/spark-env.sh I have:

export SPARK_WORKER_MEMORY=1g
export SPARK_EXECUTOR_MEMORY=512m
export SPARK_WORKER_INSTANCES=4
export SPARK_WORKER_CORES=1

But when I try to launch the workers, using $SPARK_HOME/sbin/start-slaves.sh, only the first worker is successfully launched. The log from the first worker end up like this:

16/11/24 13:32:06 INFO Worker: Successfully registered with master spark://cerberus:7077

But the log from workers 2-4 show me this error:

INFO ExternalShuffleService: Starting shuffle service on port 7337 with useSasl = false 16/11/24 13:32:08 ERROR Inbox: Ignoring error java.net.BindException: Address already in use

It seems (to me) that the first worker successfully starts the shuffle-service at port 7337, but the workers 2-4 "does not know" about this and try to launch another shuffle-service on the same port.

The problem occurs also for all workers (1-4) if I first launch a shuffle-service (using $SPARK_HOME/sbin/start-shuffle-service.sh) and then try to launch all the workers ($SPARK_HOME/sbin/start-slaves.sh).

Is any option to get around this? To be able to all workers verfy if there is a shuffle service running and connect to it instead of try to create a new service?

like image 646
Eilliar Avatar asked Nov 09 '22 05:11

Eilliar


1 Answers

I had the same issue and seemed to get it working by removing the spark.shuffle.service.enabled item from the config file (in fact I don't have any dynamicAllocation-related items in there) and instead put this in the SparkConf when I request a SparkContext:

sconf = pyspark.SparkConf() \
    .setAppName("sc1") \
    .set("spark.dynamicAllocation.enabled", "true") \
    .set("spark.shuffle.service.enabled", "true")
sc1 = pyspark.SparkContext(conf=sconf)

I start the master & slaves as normal:

$SPARK_HOME/sbin/start-all.sh

And I have to start one instance of the shuffler-service:

$SPARK_HOME/sbin/start-shuffle-service.sh

Then I started two notebooks with this context and got them both to do a small job. The first notebook's application does the job and is in the RUNNING state, the second notebook's application is in the WAITING state. After a minute (default idle timeout), the resources get reallocated and the second context gets to do its job (and both are in RUNNING state).

Hope this helps, John

like image 109
John Leonard Avatar answered Nov 15 '22 11:11

John Leonard