spark over kubernetes vs yarn/hadoop ecosystem [closed]

Tags:

I see a lot of traction for spark over kubernetes. Is it better over running spark on Hadoop? Both the approaches runs in distributive approach. Can someone help me understand the difference/comparision between running spark on kubernetes vs Hadoop ecosystem?

Thanks

205

asked Jun 26 '18 04:06

Premchand

2 Answers

Can someone help me understand the difference/comparision between running spark on kubernetes vs Hadoop ecosystem?

Be forewarned this is a theoretical answer, because I don't run Spark anymore, and thus I haven't run Spark on kubernetes, but I have maintained both a Hadoop cluster and now a kubernetes cluster, and so I can speak to some of their differences.

Kubernetes is as much a battle hardened resource manager with api access to all its components as a reasonable person could wish for. It provides very painless declarative resource limitations (both cpu and ram, plus even syscall capacities), very, very painless log egress (both back to the user via kubectl and out of the cluster using multiple flavors of log management approaches), unprecedented level of metrics gathering and egress allowing one to keep an eye on the health of the cluster and the jobs therein, and the list goes on and on.

But perhaps the biggest reason one would choose to run Spark on kubernetes is the same reason one would choose to run kubernetes at all: shared resources rather than having to create new machines for different workloads (well, plus all of those benefits above). So if you have a Spark cluster, it is very, very likely it is going to burn $$$ while a job isn't actively running on it, versus kubernetes will cheerfully schedule other jobs onto those Nodes while they aren't running Spark jobs. Yes, I am aware that Mesos and Yarn are "generic" cluster resource managers, but it has not been my experience that they are as painless or ubiquitous as kubernetes.

I would welcome someone posting the counter narrative, or contributing more hands-on experience of Spark on kubernetes, but tho

180

answered Sep 18 '22 15:09

mdaniel

To complete Matthew L Daniel opinion, the mine focuses on 2 interesting concepts that Kubernetes can bring to data pipelines: - namespaces + resource quotas help to easier separate and share resources by for instance reserving much more resources to data intensive/more unpredictable/business critical parts without necessarily new node every time - horizontal scaling - basically when Kubernetes scheduler doesn't succeed to allocate new pods that may be created with Spark's dynamic resource allocation in the future (not implemented yet), it's able to mount necessary nodes dynamically (e.g. through https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler#introduction). That said horizontal scaling are currently difficult to achieve in Apache Spark since it requires to keep the external shuffle service even for a shut down executor. So even if our load decreases, we'll still keep the nodes created to handle its increase. But when this problem will be solved Kubernetes autoscaling will be an interesting option to reduce costs, improve processing performances and make pipelines elastic.

However please notice that all these sayings are based only on personal observations and some local tests on early Spark on Kubernetes feature (2.3.0).

answered Sep 20 '22 15:09

Bartosz Konieczny

Related questions
                            
                                Spark DataFrame aggregate column values by key into List
                            
                                inferSchema in spark-csv package
                            
                                How to allow spark to ignore missing input files?
                            
                                How to Store a Python bytestring in a Spark Dataframe
                            
                                Why do Scala 2.11 and Spark with scallop lead to "java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror"?
                            
                                Spark dataframes groupby into list
                            
                                Fast Parquet row count in Spark
                            
                                Optimizing GC on EMR cluster
                            
                                Spark 2.2.0 FileOutputCommitter
                            
                                pyspark Window.partitionBy vs groupBy
                            
                                My Spark's Worker cannot connect Master.Something wrong with Akka?
                            
                                Spark using PySpark read images
                            
                                Spark SQL "<=>" operator
                            
                                Spark groupByKey alternative
                            
                                Python spark extract characters from dataframe
                            
                                Spark SQL queries on partitioned data using Date Ranges
                            
                                Connect to S3 data from PySpark
                            
                                Spark Kryo: Register a custom serializer
                            
                                Spark ML VectorAssembler returns strange output
                            
                                Why do I get "partition values: [empty row]" log messages when reading a file?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

spark over kubernetes vs yarn/hadoop ecosystem [closed]

Tags:

kubernetes

apache-spark

hadoop

Premchand

People also ask

2 Answers

mdaniel

Bartosz Konieczny

Recent Activity

Donate For Us