Where are the major differences?
Kubernetes is an open source orchestration system for Docker containers. It handles scheduling onto nodes in a compute cluster and actively manages workloads to ensure that their state matches the users declared intentions; Yarn: A new package manager for JavaScript.
There are numerous advantages to running Spark on Kubernetes rather than YARN. Let's look at the key benefits: Package all dependencies along with Spark applications in containers. This avoids dependency issues that are common with Spark.
With Hadoop, you're forced to use Java. Kubernetes allows you to use any programming language you want. Using containerized applications allows you to move to alternative cloud storage.
One of Apache Hadoop's core components, YARN is responsible for allocating system resources to the various applications running in a Hadoop cluster and scheduling tasks to be executed on different cluster nodes.
Kubernetes is developed almost from a clean slate for extending Docker container kernel to become a platform. Kubernetes development has taken bottom up approach. It has good optimization on specifying per container/pod resource requirements, but it lacks a effective global scheduler that can partition resources into logical grouping. Kubernetes design allows multiple schedulers to run in the cluster. Each scheduler manages resources within its own pods. However, Kubernetes cluster can suffer from instability when application demands more resources than physical systems can handle. It work best in infrastructure capacity exceeding application demands. Kubernetes scheduler will attempt to fill up the idle nodes with incoming application requests and terminate low priority and starvation containers to improve resource utilization. Kubernetes containers can integrate with external storage system like S3 to provide resilience to data. Kubernetes framework uses etcd to store cluster data. Etcd cluster nodes and Hadoop Namenode are both single point of failures in Kubernetes or Hadoop platform. Etcd can have more replica than Namenode, hence, from reliability point of view seems to favor Kubernetes in theory. However, Kubernetes security is default open, unless RBAC are defined with fine-grained role binding. Security context is set correctly for pods. If omitted, primary group of the pod will default to root, which can be problematic for system administrators trying to secure the infrastructure.
Apache Hadoop YARN was developed to run isolated java processes to process big data workload then improved to support Docker containers. YARN provides global level resource management like capacity queues for partitioning physical resources into logical units. Each business unit can be assigned with percentage of the cluster resources. Capacity resource sharing system is designed in favor of guarentee resource availability for Enterprise priority instead of squeezing every available physical resources. YARN does score more points in security. There are more security featuers in Kerberos, access control for privileged/non-privileged containers, trusted docker images, and placement policy constraints. Most docker related security are default to close, and system admin needs to manually turn on flags to grant more power to containers. Large enterprises tend to run Hadoop more than Kubernetes because securing the system cost less. There are more distributed SQL engines built on top of YARN, including Hive, Impala, SparkSQL and IBM BigSQL. Database options make YARN an attrative option because the ability to run online transaction processing in containers, and online analytical processing using batch workload. Hadoop Developer toolchains can be overwhelming. Mapreduce, Hive, Pig, Spark and etc, each have its own style of development. The user experience is inconsistent and take a while to learn them all. Kubernetes feels less obstructive by comparison because it only deploys docker containers. With introduction of YARN services to run Docker container workload, YARN can feel less wordy than Kubernetes.
If your plan is to out source IT operations to public cloud, pick Kubernetes. If your plan is to build private/hybrid/multi-clouds, pick Apache YARN.
While this question and answer isn't exactly what you are asking, it does touch on a number of the same points.
Last I saw, Yarn was just a resource sharing mechanism, whereas Kubernetes is an entire platform, encompassing ConfigMaps, declarative environment management, Secret management, Volume Mounts, a super well designed API for interacting with all of those things, Role Based Access Control, and Kubernetes is in wide-spread use, meaning one can very easily find both candidates to hire and tools to buy.
A blog post I found cited a master's thesis that describes some of the fascinating trade-offs between the different scheduler's view of the world. It's a lot of words, so if you're looking for a tl;dr answer, that link may not be it, but if you're looking for actual research on the topic, it seems sound.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With