Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to set up cluster environment for Spark applications on Windows machines?

I have been developing in pyspark with spark standalone non-cluster mode. These days, I would like to explore more on the cluster mode of spark. I searched on the internet, and found I may need a cluster manager to run clusters in different machines using Apache Mesos or Spark Standalone. But, I couldn't easily find details of the picture.

How should I set up from system design point of view in order to run spark clusters in multiple windows machines (or multiple windows vms).

like image 408
Yohan Chung Avatar asked Jun 08 '17 13:06

Yohan Chung


People also ask

How do I run Spark application in cluster mode?

Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.

How do I make a Spark standalone cluster?

To launch a Spark standalone cluster with the launch scripts, you should create a file called conf/workers in your Spark directory, which must contain the hostnames of all the machines where you intend to start Spark workers, one per line.


1 Answers

You may want to explore (from the simplest) Spark Standalone, through Hadoop YARN to Apache Mesos or DC/OS. See Cluster Mode Overview.

I'd recommend using Spark Standalone first (as the easiest option to submit Spark applications to). Spark Standalone is included in any Spark installation and works fine on Windows. The issue is that there are no scripts to start and stop the standalone Master and Workers (aka slaves) for Windows OS. You simply have to "code" them yourself.

Use the following to start a standalone Master on Windows:

// terminal 1
bin\spark-class org.apache.spark.deploy.master.Master

Please note that after you start the standalone Master you get no input, but don't worry and head over to http://localhost:8080/ to see the web UI of the Spark Standalone cluster.

In a separate terminal start an instance of the standalone Worker.

// terminal 2
bin\spark-class org.apache.spark.deploy.worker.Worker spark://localhost:7077

With one-worker Spark Standalone cluster up, you should be able to submit Spark applications as follows:

spark-submit --master spark://localhost:7077 ...

Read Spark Standalone Mode in the official documentation of Spark.


As I just found out Mesos is not an option given its System Requirements:

Mesos runs on Linux (64 Bit) and Mac OS X (64 Bit).

You could however run any of the clusters using virtual machines using VirtualBox or similar. At least DC/OS has dcos-vagrant that should make it fairly easy:

dcos-vagrant Quickly provision a DC/OS cluster on a local machine for development, testing, or demonstration.

Deploying DC/OS Vagrant involves creating a local cluster of VirtualBox VMs using the dcos-vagrant-box base image and then installing DC/OS.

like image 51
Jacek Laskowski Avatar answered Sep 28 '22 08:09

Jacek Laskowski