Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spark over kubernetes vs yarn/hadoop ecosystem [closed]

I see a lot of traction for spark over kubernetes. Is it better over running spark on Hadoop? Both the approaches runs in distributive approach. Can someone help me understand the difference/comparision between running spark on kubernetes vs Hadoop ecosystem?

Thanks

like image 205
Premchand Avatar asked Jun 26 '18 04:06

Premchand


People also ask

Does Spark work without YARN?

Spark In MapReduce (SIMR) In this mode of deployment, there is no need for YARN. Rather Spark jobs can be launched inside MapReduce.

Does Spark on Kubernetes need Hadoop?

You can run Spark, of course, but you can also run Python or R code, notebooks ,and even webapps. In the traditional Spark-on-YARN world, you need to have a dedicated Hadoop cluster for your Spark processing and something else for Python, R, etc.

What is the difference between Spark and YARN?

Spark on YARN Typically, Spark would be run with HDFS for storage, and with either YARN (Yet Another Resource Manager) or Mesos, two of the most common resource managers. Unlike Mesos which is an OS-level scheduler, YARN is an application-level scheduler.

What is the difference between YARN and Kubernetes?

Yarn caches every package it downloads so it never needs to again. It also parallelizes operations to maximize resource utilization so install times are faster than ever. Kubernetes can be classified as a tool in the "Container Tools" category, while Yarn is grouped under "Front End Package Manager".


2 Answers

Can someone help me understand the difference/comparision between running spark on kubernetes vs Hadoop ecosystem?

Be forewarned this is a theoretical answer, because I don't run Spark anymore, and thus I haven't run Spark on kubernetes, but I have maintained both a Hadoop cluster and now a kubernetes cluster, and so I can speak to some of their differences.

Kubernetes is as much a battle hardened resource manager with api access to all its components as a reasonable person could wish for. It provides very painless declarative resource limitations (both cpu and ram, plus even syscall capacities), very, very painless log egress (both back to the user via kubectl and out of the cluster using multiple flavors of log management approaches), unprecedented level of metrics gathering and egress allowing one to keep an eye on the health of the cluster and the jobs therein, and the list goes on and on.

But perhaps the biggest reason one would choose to run Spark on kubernetes is the same reason one would choose to run kubernetes at all: shared resources rather than having to create new machines for different workloads (well, plus all of those benefits above). So if you have a Spark cluster, it is very, very likely it is going to burn $$$ while a job isn't actively running on it, versus kubernetes will cheerfully schedule other jobs onto those Nodes while they aren't running Spark jobs. Yes, I am aware that Mesos and Yarn are "generic" cluster resource managers, but it has not been my experience that they are as painless or ubiquitous as kubernetes.

I would welcome someone posting the counter narrative, or contributing more hands-on experience of Spark on kubernetes, but tho

like image 180
mdaniel Avatar answered Sep 18 '22 15:09

mdaniel


To complete Matthew L Daniel opinion, the mine focuses on 2 interesting concepts that Kubernetes can bring to data pipelines: - namespaces + resource quotas help to easier separate and share resources by for instance reserving much more resources to data intensive/more unpredictable/business critical parts without necessarily new node every time - horizontal scaling - basically when Kubernetes scheduler doesn't succeed to allocate new pods that may be created with Spark's dynamic resource allocation in the future (not implemented yet), it's able to mount necessary nodes dynamically (e.g. through https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler#introduction). That said horizontal scaling are currently difficult to achieve in Apache Spark since it requires to keep the external shuffle service even for a shut down executor. So even if our load decreases, we'll still keep the nodes created to handle its increase. But when this problem will be solved Kubernetes autoscaling will be an interesting option to reduce costs, improve processing performances and make pipelines elastic.

However please notice that all these sayings are based only on personal observations and some local tests on early Spark on Kubernetes feature (2.3.0).

like image 44
Bartosz Konieczny Avatar answered Sep 20 '22 15:09

Bartosz Konieczny