Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark worker keep removing and adding executors

I tried to build a spark cluster using a local Ubuntu virtual machine as the master, and a remote Ubuntu virtual machine as worker. As the local virtual machine is running in a virtualbox, to make it accessible by remote guest, I forwarded the virtual machine's 7077 port to the host's 7077 port. I start master by:

./sbin/start-master.sh -h 0.0.0.0 -p 7077

I made it listening on 0.0.0.0, because if I use the default 127.0.1.1, the remote guest won't be able to connect to it. I start the worker by executing the following command on the remote machine:

./bin/spark-class org.apache.spark.deploy.worker.Worker   
spark://129.22.151.82:7077

The worker is able to connect to the master, which can be seen on the UI: Screen shot

Then I tried to run the "pi" example python code:

from pyspark import SparkContext, SparkConf
conf=SparkConf().setAppName("Pi").setMaster("spark://0.0.0.0:7077)
sc=SparkContext(conf=conf)

.... Once I run it, the program never stops, I noticed the program is always removing and adding executors, because executors always exits with error code 1. And this is the executor's stderr:

    Using Spark's default log4j profile: org/apache/spark/log4j-  defaults.properties
    16/02/25 13:22:22 INFO CoarseGrainedExecutorBackend: 
Registered signal handlers for [TERM, HUP, INT]
    16/02/25 13:22:22 WARN NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java 
classes where applicable
    16/02/25 13:22:23 INFO SecurityManager: Changing view acls to: kxz138,adminuser
    16/02/25 13:22:23 INFO SecurityManager: Changing modify acls to: kxz138,adminuser
    16/02/25 13:22:23 INFO SecurityManager: SecurityManager: 
authentication disabled; ui acls disabled; users with view 
permissions: Set(kxz138, adminuser); users with modify permissions: 
Set(kxz138, adminuser)
    **16/02/25 13:22:23 ERROR UserGroupInformation: 
PriviledgedActionException as:adminuser (auth:SIMPLE) 
cause:java.io.IOException: Failed to connect to /10.0.2.15:34935
    Exception in thread "main" java.io.IOException: Failed to connect to /10.0.2.15:34935**
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
    at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:200)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)

I noticed the error here is actually a network problem. The worker is actually trying to access 10.0.2.15 which is the local NAT ip address of my virtual machine, but failed. This error never occurs when I only deploy worker from my local computer. Anyone has any idea why this error occurs? Why the worker is trying to access the ip address 10.0.2.15 instead of my public IP?

BTW, I've already set up the key-less ssh access from master to slave.

like image 540
Seymour Zhang Avatar asked Nov 20 '22 13:11

Seymour Zhang


1 Answers

I solved the problem by making sure that the VMs within the cluster belong to the same subnetwork. As an example, initially, I had set the IP 192.168.56.101 as the master node and 192.168.57.101 as the worker node with 255.255.255.0 as the subnet mask. But this means that both IP addresses are not within the same subnetwork. After I change the subnet mask to e.g., 255.255.0.0, I could run my app properly. You may need to edit some configuration files accordingly as well (e.g., ~/.bashrc,conf/spark-env.sh, conf/slaves.sh, conf/spark-default.conf and /etc/hosts)

like image 50
fatarms Avatar answered Nov 23 '22 03:11

fatarms