Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Basic Spark example not working

Tags:

apache-spark

I'm learning Spark and wanted to run the simplest possible cluster consisting of two physical machines. I've done all the basic setup and it seems to be fine. The output of the automatic start script looks as follows:

[username@localhost sbin]$ ./start-all.sh 
starting org.apache.spark.deploy.master.Master, logging to /home/username/spark-1.6.0-bin-hadoop2.6/logs/spark-username-org.apache.spark.deploy.master.Master-1-localhost.out
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /home/sername/spark-1.6.0-bin-hadoop2.6/logs/spark-username-org.apache.spark.deploy.worker.Worker-1-localhost.out
[email protected].???.??: starting org.apache.spark.deploy.worker.Worker, logging to /home/username/spark-1.6.0-bin-hadoop2.6/logs/spark-username-org.apache.spark.deploy.worker.Worker-1-localhost.localdomain.out

so no errors here and seems that a Master node is running as well as two Worker nodes. However when I open the WebGUI at 192.168.???.??:8080, it only lists one worker - the local one. My issue is similar to that described here: Spark Clusters: worker info doesn't show on web UI but There's nothing going on in my /etc/hosts file. All it contains is:

127.0.0.1 localhost.localdomain localhost
::1 localhost6.localdomain6 localhost6 

What am I missing? Both machines are running Fedora Workstation x86_64.

like image 333
Krzysiek Setlak Avatar asked Feb 16 '16 13:02

Krzysiek Setlak


People also ask

What is SparkPi?

SparkPi is deployed as a single server pod and several Apache Spark pods. The microservice provides an HTTP server that accepts GET requests and responds with an estimation of Pi. It does this by using Apache Spark to calculate the value by using a Monte Carlo method.

How do I run Spark submit in cluster mode?

You can submit a Spark batch application by using cluster mode (default) or client mode either inside the cluster or from an external client: Cluster mode (default): Submitting Spark batch application and having the driver run on a host in your driver resource group. The spark-submit syntax is --deploy-mode cluster.

How do I run Pyspark in terminal?

Go to the Spark Installation directory from the command line and type bin/pyspark and press enter, this launches pyspark shell and gives you a prompt to interact with Spark in Python language. If you have set the Spark in a PATH then just enter pyspark in command line or terminal (mac users).


1 Answers

it seems like spark is very picky about IP and machine names. so, when starting your master, it will use your machine name to register spark master. if that name is not reachable from your workers, it will be almost impossible to reach.

a workaround is to start your master like this

SPARK_MASTER_IP=YOUR_SPARK_MASTER_IP ${SPARK_HOME}/sbin/start-master.sh

then, you will be able to connect your slaves like this

${SPARK_HOME}/sbin/start-slave.sh spark://**YOUR_SPARK_MASTER_IP**:PORT

and there you go!

like image 125
dsncode Avatar answered Oct 02 '22 14:10

dsncode