Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark-shell connecting to Mesos stuck at sched.cpp

Below are my spark-defaults.conf and the output of spark-shell

$ cat conf/spark-defaults.conf
spark.master                     mesos://172.16.**.***:5050
spark.eventLog.enabled           false
spark.broadcast.compress         false
spark.driver.memory              4g
spark.executor.memory            4g
spark.executor.instances         1

$ bin/spark-shell
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties
To adjust logging level use sc.setLogLevel("INFO")
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.2
      /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_80)
Type in expressions to have them evaluated.
Type :help for more information.
15/11/15 04:56:11 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
I1115 04:56:12.171797 72994816 sched.cpp:164] Version: 0.25.0
I1115 04:56:12.173741 67641344 sched.cpp:262] New master detected at [email protected].**.***:5050
I1115 04:56:12.173951 67641344 sched.cpp:272] No credentials provided. Attempting to register without authentication

It hangs here indefinitely while Mesos Web UI shows that a lot of Spark frameworks were spinning- continuously registering and unregistering until I quit spark-shell with Ctrl-C.

Mesos Web UI

I suspect that it is partly caused by my laptop having multiple ip addresses. When run on server, it continues to the next line, and the usual Scala REPL:

I1116 09:53:30.265967 29327 sched.cpp:641] Framework registered with 9d725348-931a-48fb-96f7-d29a4b09f3e8-0242
15/11/16 09:53:30 INFO mesos.MesosSchedulerBackend: Registered as framework ID 9d725348-931a-48fb-96f7-d29a4b09f3e8-0242
15/11/16 09:53:30 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 57810.
15/11/16 09:53:30 INFO netty.NettyBlockTransferService: Server created on 57810
15/11/16 09:53:30 INFO storage.BlockManagerMaster: Trying to register BlockManager
15/11/16 09:53:30 INFO storage.BlockManagerMasterEndpoint: Registering block manager 172.16.**.***:57810 with 2.1 GB RAM, BlockManagerId(driver, 172.16.**.***, 57810)
15/11/16 09:53:30 INFO storage.BlockManagerMaster: Registered BlockManager
15/11/16 09:53:30 INFO repl.Main: Created spark context..
Spark context available as sc.

I'm running Mesos 0.25.0 built by Mesosphere, and I'm setting spark.driver.host to the address that is accessible from all machines in the Mesos cluster. I see that every port open by spark-shell's process is bind either to that IP address or to *.

The most similar question on StackOverflow doesn't seem to be helpful, because in this case my laptop should be accessible from the hosts.

I couldn't locate the log files which might contain why the frameworks were unregistered. Where should I look for to resolve this issue?

like image 801
lyomi Avatar asked Nov 16 '15 01:11

lyomi


1 Answers

Mesos has a very odd notion of how networking works -- in particular, it is important that you can establish bidirectional communication between the Master and Framework. So both sides need to have a mutual network route. If you run behind NAT or containers, you've run into this before -- usually you need to set LIBPROCESS_IP to your publicly accessible IP on the Framework side. Perhaps this applies to multihomed systems as well, like your laptop.

You can find a bit more information lying around the internet, although it's unfortunately not well documented. There's a hint on their Deployment Scripts page though.

like image 81
Steven Schlansker Avatar answered Nov 04 '22 16:11

Steven Schlansker