I have read that Apache Spark stores data in-memory. However, Apache Spark is meant for analyzing huge volumes of data (a.k.a big data analytics). In this context, what does in-memory data storage really mean? Is the data that it can store limited by the RAM available? How does its data storage compare with Apache Hadoop which uses HDFS?
What is Spark In-memory Computing? In in-memory computation, the data is kept in random access memory(RAM) instead of some slow disk drives and is processed in parallel. Using this we can detect a pattern, analyze large data. This has become popular because it reduces the cost of memory.
In-memory cluster computation enables Spark to run iterative algorithms, as programs can checkpoint data and refer back to it without reloading it from disk; in addition, it supports interactive querying and streaming data analysis at extremely fast speeds.
Spark holds intermediate results in memory rather than writing them to disk which is very useful especially when you need to work on the same dataset multiple times. It's designed to be an execution engine that works both in-memory and on-disk.
Spark DataFrame or Dataset cache() method by default saves it to storage level ` MEMORY_AND_DISK ` because recomputing the in-memory columnar representation of the underlying table is expensive. Note that this is different from the default cache level of ` RDD. cache() ` which is ' MEMORY_ONLY '.
In Hadoop, data is persisted to disk between steps, so a typical multi-step job ends up looking something like this:
hdfs -> read & map -> persist -> read & reduce -> hdfs -> read & map -> persist -> read and reduce -> hdfs
This is a brilliant design, and it makes perfect sense to use, when you're batch-processing files that fits the map-reduce pattern well. But for some workloads, this can be extremely slow - iterative algorithms are especially affected negatively. You've spent time creating some data structure (a graph for instance), and all you want to do in each step, is update a score. Persisting and reading the entire graph to/from disk will slow down your job.
Spark uses a more general engine that supports cyclic data flows, and will try to keep things in memory in between job steps. What this means is, if you can create a data structure and partitioning strategy, where your data doesn't shuffle around between each step in your job, you can efficiently update it without serialising and writing everything to disk in between steps. That's the reason why Spark's got a chart on their front page showing a 100x speedup on logical regression.
If you write a Spark job that just computes a value from each input line in your dataset, and write that back to disk, Hadoop and Spark will be pretty much equal in terms of performance (start-up time is faster in Spark, but that hardly matters when we spend hours on processing data in a single step).
If Spark cannot hold an RDD in memory in between steps, it will spill it to disk, much like Hadoop does. But remember that Spark isn't a silver bullet, and there will be corner cases where you'll have to fight Spark's in-memory nature causing OutOfMemory problems, where Hadoop would just write everything to disk.
I personally like to think of it this way: In your 500 64GB machines cluster, Hadoop is created to efficiently batch process your 500 TB job faster by distributing disk reads and writes. Spark utilises the fact that 500*64GB=32TB worth of memory can likely solve quite a few of your other problems entirely in-memory!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With