I have started to learn about Apache Spark and am very impressed by the framework. Although one thing which keeps bothering me is that in all Spark presentations they talk about how Spark caches the RDDs and therefore multiple operations which need the same data are faster than other approaches like Map Reduce.
So the question I had is that if this is the case, then just add a caching engine inside of MR frameworks like Yarn/Hadoop.
Why to create a new framework altogether?
I am sure I am missing something here and you will be able to point me to some documentation which educates me more on spark.
Caching + in memory computation is definitely a big thing for spark, However there are other things.
RDD(Resilient Distributed Data set): an RDD is the main abstraction of spark. It allows recovery of failed nodes by re-computation of the DAG while also supporting a more similar recovery style to Hadoop by way of checkpointing, to reduce the dependencies of an RDD. Storing a spark job in a DAG allows for lazy computation of RDD's and can also allow spark's optimization engine to schedule the flow in ways that make a big difference in performance.
Spark API: Hadoop MapReduce has a very strict API that doesn't allow for as much versatility. Since spark abstracts away many of the low level details it allows for more productivity. Also things like broadcast variables and accumulators are much more versatile than DistributedCache and counters IMO.
Spark Streaming: spark streaming is based on a paper Discretized Streams, which proposes a new model for doing windowed computations on streams using micro batches. Hadoop doesn't support anything like this.
As a product of in memory computation spark sort of acts as it's own flow scheduler. Whereas with standard MR you need an external job scheduler like Azkaban or Oozie to schedule complex flows
The hadoop project is made up of MapReduce, YARN, commons and HDFS; spark however is attempting to create one unified big data platform with libraries (in the same repo) for machine learning, graph processing, streaming, multiple sql type libraries and I believe a deep learning library is in the beginning stages. While none of this is strictly a feature of spark it is a product of spark's computing model. Tachyon and BlinkDB are two other technologies that are built around spark.
So its much more than just caching. Aaronman covered a lot so ill only add what he missed.
Raw performance w/o caching is 2-10x faster due to a generally more efficient and well archetected framework. E.g. 1 jvm per node with akka threads is better than forking a whole process for each task.
Scala API. Scala stands for Scalable Language and is clearly the best language to choose for parallel processing. They say Scala cuts down code by 2-5x, but in my experience from refactoring code in other languages - especially java mapreduce code, its more like 10-100x less code. Seriously I have refactored 100s of LOC from java into a handful of Scala / Spark. Its also much easier to read and reason about. Spark is even more concise and easy to use than the Hadoop abstraction tools like pig & hive, its even better than Scalding.
Spark has a repl / shell. The need for a compilation-deployment cycle in order to run simple jobs is eliminated. One can interactively play with data just like one uses Bash to poke around a system.
The last thing that comes to mind is ease of integration with Big Table DBs, like cassandra and hbase. In cass to read a table in order to do some analysis one just does
sc.cassandraTable[MyType](tableName).select(myCols).where(someCQL)
Similar things are expected for HBase. Now try doing that in any other MPP framework!!
UPDATE thought of pointing out this is just the advantages of Spark, there are quite a few useful things on top. E.g. GraphX for graph processing, MLLib for easy machine learning, Spark SQL for BI, BlinkDB for insane fast apprx queries, and as mentioned Spark Streaming
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With