Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark atop of Docker not accepting jobs

I'm trying to make a hello world example work with spark+docker, and here is my code.

object Generic {
  def main(args: Array[String]) {
    val sc = new SparkContext("spark://172.17.0.3:7077", "Generic", "/opt/spark-0.9.0")

    val NUM_SAMPLES = 100000
    val count = sc.parallelize(1 to NUM_SAMPLES).map{i =>
      val x = Math.random * 2 - 1
      val y = Math.random * 2 - 1
      if (x * x + y * y < 1) 1.0 else 0.0
    }.reduce(_ + _)

    println("Pi is roughly " + 4 * count / NUM_SAMPLES)
  }
}

When I run sbt run, I get

14/05/28 15:19:58 INFO client.AppClient$ClientActor: Connecting to master spark://172.17.0.3:7077...
14/05/28 15:20:08 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

I checked both the cluster UI, where I have 3 nodes that each have 1.5g of memory, and the namenode UI, where I see the same thing.

The docker logs show no output from the workers and the following from the master

14/05/28 21:20:38 ERROR EndpointWriter: AssociationError [akka.tcp://sparkMaster@master:7077] -> [akka.tcp://[email protected]:48085]: Error [Association failed with [akka.tcp://[email protected]:48085]] [
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://[email protected]:48085]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: /10.0.3.1:48085

]

This happens a couple times, and then the program times out and dies with

[error] (run-main-0) org.apache.spark.SparkException: Job aborted: Spark cluster looks down

When I did a tcpdump over the docker0 interface, and it looks like the workers and the master nodes are talking.

However, the spark console works.

If I set sc as val sc = new SparkContext("local", "Generic", System.getenv("SPARK_HOME")), the program runs

like image 547
Peter Klipfel Avatar asked May 28 '14 21:05

Peter Klipfel


1 Answers

I've been there. The issue looks like the AKKA actor subsystem in Spark is binding on a different interface than Spark on docker0.

While your master ip is on: spark://172.17.0.3:7077

Akka is binding on: akka.tcp://[email protected]:48085

If you masters/slaves are docker containers, they should be communicating through the docker0 interface in the 172.17.x.x range.

Try providing the master and slaves with their correct local IP using the env config SPARK_LOCAL_IP. See config docs for details.

In our docker setup for Spark 0.9 we are using this command to start the slaves:

${SPARK_HOME}/bin/spark-class org.apache.spark.deploy.worker.Worker $MASTER_IP -i $LOCAL_IP 

Which directly provides the local IP to the worker.

like image 117
maasg Avatar answered Oct 11 '22 19:10

maasg