I am new to Apache Spark, and I just learned that Spark supports three types of cluster:
I think I should try Standalone first. In the future, I need to build a large cluster (hundreds of instances).
Which cluster type should I choose?
Apache Spark has 4 main open source cluster managers: Mesos, YARN, Standalone, and Kubernetes. Every cluster manager has its own unique requirements and differences. In order to support the scheduling engine in IBM Spectrum Conductor it required modifications to some core pieces of Spark.
In the cluster mode, the Spark driver or spark application master will get started in any of the worker machines. So, the client who is submitting the application can submit the application and the client can go away after initiating the application or can continue with some other work.
Spark Standalone Manager : A simple cluster manager included with Spark that makes it easy to set up a cluster. By default, each application uses all the available nodes in the cluster.
A few benefits of YARN over Standalone & Mesos:
YARN allows you to dynamically share and centrally configure the same pool of cluster resources between all frameworks that run on YARN.
You can take advantage of all the features of YARN schedulers for categorizing, isolating, and prioritizing workloads.
The Spark standalone mode requires each application to run an executor on every node in the cluster; whereas with YARN, you choose the number of executors to use
YARN directly handles rack and machine locality in your requests, which is convenient.
The resource request model is, oddly, backwards in Mesos. In YARN, you (the framework) request containers with a given specification and give locality preferences. In Mesos you get resource "offers" and choose to accept or reject those based on your own scheduling policy. The Mesos model is a arguably more flexible, but seemingly more work for the person implementing the framework.
If you have a big Hadoop cluster already in place, YARN is better choice.
The Standalone manager requires the user configure each of the nodes with the shared secret. Mesos’ default authentication module, Cyrus SASL, can be replaced with a custom module. YARN has security for authentication, service level authorization, authentication for Web consoles and data confidentiality. Hadoop authentication uses Kerberos to verify that each user and service is authenticated by Kerberos.
Useful links:
spark documentation page
agildata article
I think the best to answer that are those who work on Spark. So, from Learning Spark
Start with a standalone cluster if this is a new deployment. Standalone mode is the easiest to set up and will provide almost all the same features as the other cluster managers if you are only running Spark.
If you would like to run Spark alongside other applications, or to use richer resource scheduling capabilities (e.g. queues), both YARN and Mesos provide these features. Of these, YARN will likely be preinstalled in many Hadoop distributions.
One advantage of Mesos over both YARN and standalone mode is its fine-grained sharing option, which lets interactive applications such as the Spark shell scale down their CPU allocation between commands. This makes it attractive in environments where multiple users are running interactive shells.
In all cases, it is best to run Spark on the same nodes as HDFS for fast access to storage. You can install Mesos or the standalone cluster manager on the same nodes manually, or most Hadoop distributions already install YARN and HDFS together.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With