On 3 node Spark/Hadoop cluster which scheduler(Manager) will work efficiently? Currently I am using Standalone Manager, but for each spark job I have to explicitly specify all resource parameters(e.g: cores,memory etc),which I want to avoid. I have tried Yarn as well, but it's running 10X slower than standalone manager.
Can Mesos will be helpful?
Cluster Details: Spark 1.2.1 and Hadoop 2.7.1
In between YARN and Mesos, YARN is specially designed for Hadoop work loads whereas Mesos is designed for all kinds of work loads. YARN is application level scheduler and Mesos is OS level scheduler. it is better to use YARN if you have already running Hadoop cluster (Apache/CDH/HDP).
Spark standalone mode requires each application to run an executor on every node in the cluster, whereas with YARN, you choose the number of executors to use.
There are three Spark cluster manager, Standalone cluster manager, Hadoop YARN and Apache Mesos. Apache Spark supports these three type of cluster manager. We will also highlight the working of Spark cluster manager in this document. In closing, we will also learn Spark Standalone vs YARN vs Mesos.
Apache Spark runs in the following cluster modes
Local mode is used to run Spark applications on Operating system. This mode is useful for Spark application development and testing.
Modes like standalone, Yarn, Mesos and Kubernetes modes are distributed environment. In distributed environment, resource management is very important to manage the computing resources. So to manage computing resources in efficient way, we need good resource management system or Resource Schedular.
Standalone is good for small spark clusters, but it is not good for bigger clusters (There is an overhead of running spark daemons(master + slave) in cluster nodes). These daemons require dedicated resources. So standalone is not recommended for bigger production clusters. Standalone supports only Spark applications and it is not general purpose cluster manager. In Enterprise context where we have variety of work loads to run, spark standalone cluster manager is not a good a choice.
In case of YARN and Mesos mode, Spark runs as an application and there are no daemons overhead. So we can use either YARN or Mesos for better performance and scalability. Both YARN and Mesos are general purpose distributed resource management and they support a variety of work loads like MapReduce, Spark, Flink, Storm etc... with container orchestration. They are good for running large scale Enterprise production clusters.
In between YARN and Mesos, YARN is specially designed for Hadoop work loads whereas Mesos is designed for all kinds of work loads. YARN is application level scheduler and Mesos is OS level scheduler. it is better to use YARN if you have already running Hadoop cluster (Apache/CDH/HDP). In case of a brand new project, better to use Mesos(Apache, Mesosphere). There is also a provision to use both of them in colocated manner using Project called Apache Myriad.
Kubernetes - Open source system for automating deployment, scaling, and management of containerized applications. So it used for running Spark applications in containerized fashion. Most of the cloud vendors like Google, Microsoft, Amazon offering Kubernetes platform as service in Cloud. We can also have on-prim K8S cluster to run Spark applications in containerized fashion. Here containers are Docker or CGroups/Linux Container.
Nomad - It is another open source system for running Spark applications. This cluster manager is not officially supported by the Spark project as a cluster manager.
Out of all above modes, Apache Mesos has better resource management capabilities.
Please see this link, it contains a detailed explanation from expertise about Yarn vs Mesos. http://www.quora.com/How-does-YARN-compare-to-Mesos
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With