Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark shell crashes when trying to start executor on worker

Background

I have been battling with Apache Spark and have worked out most errors except one. I have a Master and one Slave. I can start the master via

./sbin/start-master.sh

and then I can connect to it from the slave by

JAVA_OPTS="-Xmx10g" ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://10.17.16.43:7077

I then see the success message

14/08/25 08:47:04 INFO worker.Worker: Successfully registered with master spark://10.17.16.43:7077

All of these errors are repeatable (I have been at this for a while). I can telnet into the master from the slave just fine as is mentioned in most other tutorials. SSH is configured to not need passwords between master and slave (RSA keys) as mentioned elsewhere.

I have spark/conf/spark-env.sh set to the following. There are more lines that are commented out

export SPARK_DAEMON_JAVA_OPTS+=" -Dspark.local.dir=/mnt/spark,/mnt2/spark -Dspark.akka.logLifecycleEvents=true"
export SPARK_LOCAL_IP=`ifconfig | sed -En 's/127.0.0.1//;s/.*inet (addr:)?(([0-9]*\.){3}[0-9]*).*/\2/p' | head -1`
export SPARK_MASTER_IP=$SPARK_LOCAL_IP
export SPARK_MASTER_WEBUI_PORT=8090
export SPARK_WORKER_CORES=1

I pulled those from various tutorials in hope that they would fix something.

Here is my master /etc/hosts

127.0.0.1       localhost
10.17.16.43     aidan-workstation
10.17.16.49     ubuntu

And slave

127.0.0.1   localhost
10.17.16.49 ubuntu
10.17.16.43 aidan-workstation

The Error

When I run ./bin/spark-shell

I get the following in the master terminal ( just posted the tail end of it the full output is here )

14/08/25 08:58:25 INFO client.AppClient$ClientActor: Executor added: app-20140825085822-0002/8 on worker-20140825084704-ubuntu-49237 (ubuntu:49237) with 8 cores
14/08/25 08:58:25 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20140825085822-0002/8 on hostPort ubuntu:49237 with 8 cores, 512.0 MB RAM
14/08/25 08:58:25 INFO client.AppClient$ClientActor: Executor updated: app-20140825085822-0002/8 is now RUNNING
14/08/25 08:58:25 INFO client.AppClient$ClientActor: Executor updated: app-20140825085822-0002/8 is now FAILED (Command exited with code 1)
14/08/25 08:58:25 INFO cluster.SparkDeploySchedulerBackend: Executor app-20140825085822-0002/8 removed: Command exited with code 1
14/08/25 08:58:25 INFO client.AppClient$ClientActor: Executor added: app-20140825085822-0002/9 on worker-20140825084704-ubuntu-49237 (ubuntu:49237) with 8 cores
14/08/25 08:58:25 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20140825085822-0002/9 on hostPort ubuntu:49237 with 8 cores, 512.0 MB RAM
14/08/25 08:58:25 INFO client.AppClient$ClientActor: Executor updated: app-20140825085822-0002/9 is now RUNNING
14/08/25 08:58:25 INFO client.AppClient$ClientActor: Executor updated: app-20140825085822-0002/9 is now FAILED (Command exited with code 1)
14/08/25 08:58:25 INFO cluster.SparkDeploySchedulerBackend: Executor app-20140825085822-0002/9 removed: Command exited with code 1
14/08/25 08:58:25 ERROR client.AppClient$ClientActor: Master removed our application: FAILED; stopping client
14/08/25 08:58:25 WARN cluster.SparkDeploySchedulerBackend: Disconnected from Spark cluster! Waiting for reconnection...

And at the same time the slave outputs (tail as well full output is here as well)

14/08/25 09:04:18 INFO worker.ExecutorRunner: Launch command: "/usr/lib/jvm/java-8-oracle/bin/java" "-cp" ":/home/hduser/spark/conf:/home/hduser/spark/assembly/target/scala-2.10/spark-assembly_2.10-0.9.2-hadoop2.2.0.jar:/home/hduser/hadoop/etc/hadoop" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "akka.tcp://spark@aidan-workstation:60456/user/CoarseGrainedScheduler" "7" "ubuntu" "8" "akka.tcp://sparkWorker@ubuntu:55553/user/Worker" "app-20140825090434-0003"
14/08/25 09:04:18 INFO worker.Worker: Executor app-20140825090434-0003/7 finished with state FAILED message Command exited with code 1 exitStatus 1
14/08/25 09:04:18 INFO worker.Worker: Asked to launch executor app-20140825090434-0003/8 for Spark shell
14/08/25 09:04:18 INFO worker.ExecutorRunner: Launch command: "/usr/lib/jvm/java-8-oracle/bin/java" "-cp" ":/home/hduser/spark/conf:/home/hduser/spark/assembly/target/scala-2.10/spark-assembly_2.10-0.9.2-hadoop2.2.0.jar:/home/hduser/hadoop/etc/hadoop" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "akka.tcp://spark@aidan-workstation:60456/user/CoarseGrainedScheduler" "8" "ubuntu" "8" "akka.tcp://sparkWorker@ubuntu:55553/user/Worker" "app-20140825090434-0003"
14/08/25 09:04:19 INFO worker.Worker: Executor app-20140825090434-0003/8 finished with state FAILED message Command exited with code 1 exitStatus 1
14/08/25 09:04:19 INFO worker.Worker: Asked to launch executor app-20140825090434-0003/9 for Spark shell
14/08/25 09:04:19 INFO worker.ExecutorRunner: Launch command: "/usr/lib/jvm/java-8-oracle/bin/java" "-cp" ":/home/hduser/spark/conf:/home/hduser/spark/assembly/target/scala-2.10/spark-assembly_2.10-0.9.2-hadoop2.2.0.jar:/home/hduser/hadoop/etc/hadoop" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "akka.tcp://spark@aidan-workstation:60456/user/CoarseGrainedScheduler" "9" "ubuntu" "8" "akka.tcp://sparkWorker@ubuntu:55553/user/Worker" "app-20140825090434-0003"
14/08/25 09:04:19 INFO worker.Worker: Executor app-20140825090434-0003/9 finished with state FAILED message Command exited with code 1 exitStatus 1

You may notice that the times are off. This is my fault. I had to re run the programs at different times to get a clean output. This is not due to the program.

What I want

How can I connect my master and slave such that I can run Scala programs on a distributed system?

like image 212
ignorance Avatar asked Aug 25 '14 16:08

ignorance


People also ask

What happens if a Spark executor fails?

FileAlreadyExistsException in Spark jobs As a result, the FileAlreadyExistsException error occurs. When any Spark executor fails, Spark retries to start the task, which might result into FileAlreadyExistsException error after the maximum number of retries.

How do I start a Spark worker?

Start the Spark worker on a specific port (default: random). Port for the worker web UI (default: 8081). Directory to run applications in, which will include both logs and scratch space (default: SPARK_HOME/work). Configuration properties that apply only to the worker in the form "-Dx=y" (default: none).

How do I start Spark in shell?

Launch Spark Shell (spark-shell) CommandGo to the Apache Spark Installation directory from the command line and type bin/spark-shell and press enter, this launches Spark shell and gives you a scala prompt to interact with Spark in scala language.

Which of the following will cause a Spark job to fail?

In Spark, stage failures happen when there's a problem with processing a Spark task. These failures can be caused by hardware issues, incorrect Spark configurations, or code problems.


1 Answers

I note from your logs that akka is using a simple hostname aidan-workstation rather than a fully qualified domain name like aidan-workstation.acme.com

akka.tcp://spark@aidan-workstation:60456/user/CoarseGrainedScheduler
akka.tcp://sparkWorker@ubuntu:55553/user/Worker

From this user post it "may" be the issue you're having

I had to set SPARK_MASTER_IP in conf/start-master.sh to hostname -f instead of hostname, since akka seems not to work properly with host names / ip, it requires fully qualified domain names.

You can try editing your hosts file to include a faked domain name.

like image 125
Brad Avatar answered Sep 28 '22 03:09

Brad