I have been developing in pyspark with spark standalone non-cluster mode. These days, I would like to explore more on the cluster mode of spark. I searched on the internet, and found I may need a cluster manager to run clusters in different machines using Apache Mesos or Spark Standalone. But, I couldn't easily find details of the picture.
How should I set up from system design point of view in order to run spark clusters in multiple windows machines (or multiple windows vms).
Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.
To launch a Spark standalone cluster with the launch scripts, you should create a file called conf/workers in your Spark directory, which must contain the hostnames of all the machines where you intend to start Spark workers, one per line.
You may want to explore (from the simplest) Spark Standalone, through Hadoop YARN to Apache Mesos or DC/OS. See Cluster Mode Overview.
I'd recommend using Spark Standalone first (as the easiest option to submit Spark applications to). Spark Standalone is included in any Spark installation and works fine on Windows. The issue is that there are no scripts to start and stop the standalone Master and Workers (aka slaves) for Windows OS. You simply have to "code" them yourself.
Use the following to start a standalone Master on Windows:
// terminal 1
bin\spark-class org.apache.spark.deploy.master.Master
Please note that after you start the standalone Master you get no input, but don't worry and head over to http://localhost:8080/ to see the web UI of the Spark Standalone cluster.
In a separate terminal start an instance of the standalone Worker.
// terminal 2
bin\spark-class org.apache.spark.deploy.worker.Worker spark://localhost:7077
With one-worker Spark Standalone cluster up, you should be able to submit Spark applications as follows:
spark-submit --master spark://localhost:7077 ...
Read Spark Standalone Mode in the official documentation of Spark.
As I just found out Mesos is not an option given its System Requirements:
Mesos runs on Linux (64 Bit) and Mac OS X (64 Bit).
You could however run any of the clusters using virtual machines using VirtualBox or similar. At least DC/OS has dcos-vagrant that should make it fairly easy:
dcos-vagrant Quickly provision a DC/OS cluster on a local machine for development, testing, or demonstration.
Deploying DC/OS Vagrant involves creating a local cluster of VirtualBox VMs using the dcos-vagrant-box base image and then installing DC/OS.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With