I tried to build a spark cluster using a local Ubuntu virtual machine as the master, and a remote Ubuntu virtual machine as worker. As the local virtual machine is running in a virtualbox, to make it accessible by remote guest, I forwarded the virtual machine's 7077 port to the host's 7077 port. I start master by:
./sbin/start-master.sh -h 0.0.0.0 -p 7077
I made it listening on 0.0.0.0
, because if I use the default 127.0.1.1
, the remote guest won't be able to connect to it.
I start the worker by executing the following command on the remote machine:
./bin/spark-class org.apache.spark.deploy.worker.Worker
spark://129.22.151.82:7077
The worker is able to connect to the master, which can be seen on the UI:
Then I tried to run the "pi" example python code:
from pyspark import SparkContext, SparkConf
conf=SparkConf().setAppName("Pi").setMaster("spark://0.0.0.0:7077)
sc=SparkContext(conf=conf)
....
Once I run it, the program never stops, I noticed the program is always removing and adding executors, because executors always exits with error code 1. And this is the executor's stderr
:
Using Spark's default log4j profile: org/apache/spark/log4j- defaults.properties
16/02/25 13:22:22 INFO CoarseGrainedExecutorBackend:
Registered signal handlers for [TERM, HUP, INT]
16/02/25 13:22:22 WARN NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java
classes where applicable
16/02/25 13:22:23 INFO SecurityManager: Changing view acls to: kxz138,adminuser
16/02/25 13:22:23 INFO SecurityManager: Changing modify acls to: kxz138,adminuser
16/02/25 13:22:23 INFO SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view
permissions: Set(kxz138, adminuser); users with modify permissions:
Set(kxz138, adminuser)
**16/02/25 13:22:23 ERROR UserGroupInformation:
PriviledgedActionException as:adminuser (auth:SIMPLE)
cause:java.io.IOException: Failed to connect to /10.0.2.15:34935
Exception in thread "main" java.io.IOException: Failed to connect to /10.0.2.15:34935**
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:200)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
I noticed the error here is actually a network problem. The worker is actually trying to access 10.0.2.15
which is the local NAT ip address of my virtual machine, but failed.
This error never occurs when I only deploy worker from my local computer.
Anyone has any idea why this error occurs? Why the worker is trying to access the ip address 10.0.2.15
instead of my public IP?
BTW, I've already set up the key-less ssh access from master to slave.
I solved the problem by making sure that the VMs within the cluster belong to the same subnetwork. As an example, initially, I had set the IP 192.168.56.101
as the master node and 192.168.57.101
as the worker node with 255.255.255.0
as the subnet mask. But this means that both IP addresses are not within the same subnetwork. After I change the subnet mask to e.g., 255.255.0.0
, I could run my app properly. You may need to edit some configuration files accordingly as well (e.g., ~/.bashrc
,conf/spark-env.sh
, conf/slaves.sh
, conf/spark-default.conf
and /etc/hosts
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With