What does in-memory data storage mean in the context of Apache Spark?

Tags:

I have read that Apache Spark stores data in-memory. However, Apache Spark is meant for analyzing huge volumes of data (a.k.a big data analytics). In this context, what does in-memory data storage really mean? Is the data that it can store limited by the RAM available? How does its data storage compare with Apache Hadoop which uses HDFS?

862

asked Aug 15 '14 21:08

Chidu

1 Answers

In Hadoop, data is persisted to disk between steps, so a typical multi-step job ends up looking something like this:

hdfs -> read & map -> persist -> read & reduce -> hdfs -> read & map -> persist -> read and reduce -> hdfs

This is a brilliant design, and it makes perfect sense to use, when you're batch-processing files that fits the map-reduce pattern well. But for some workloads, this can be extremely slow - iterative algorithms are especially affected negatively. You've spent time creating some data structure (a graph for instance), and all you want to do in each step, is update a score. Persisting and reading the entire graph to/from disk will slow down your job.

Spark uses a more general engine that supports cyclic data flows, and will try to keep things in memory in between job steps. What this means is, if you can create a data structure and partitioning strategy, where your data doesn't shuffle around between each step in your job, you can efficiently update it without serialising and writing everything to disk in between steps. That's the reason why Spark's got a chart on their front page showing a 100x speedup on logical regression.

If you write a Spark job that just computes a value from each input line in your dataset, and write that back to disk, Hadoop and Spark will be pretty much equal in terms of performance (start-up time is faster in Spark, but that hardly matters when we spend hours on processing data in a single step).

If Spark cannot hold an RDD in memory in between steps, it will spill it to disk, much like Hadoop does. But remember that Spark isn't a silver bullet, and there will be corner cases where you'll have to fight Spark's in-memory nature causing OutOfMemory problems, where Hadoop would just write everything to disk.

I personally like to think of it this way: In your 500 64GB machines cluster, Hadoop is created to efficiently batch process your 500 TB job faster by distributing disk reads and writes. Spark utilises the fact that 500*64GB=32TB worth of memory can likely solve quite a few of your other problems entirely in-memory!

182

answered Oct 09 '22 11:10

jkgeyti

Related questions
                            
                                best possible implementation of the travelling salesman / vehicle routing use case
                            
                                MapReduce Linear Programming
                            
                                What is the best way to run Map/Reduce stuff on data from Mongo?
                            
                                Using Apache Spark as a backend for web application [closed]
                            
                                Is star schema still necessary for a big-data-warehouse?
                            
                                Available reducers in Elastic MapReduce
                            
                                Difference between 'Stored as InputFormat, OutputFormat' and 'Stored as' in Hive
                            
                                Hadoop Hello World Example And Introduction [closed]
                            
                                Failed to report status for 600 seconds. Killing! Reporting progress in hadoop
                            
                                Hadoop Map Reduce read a text file
                            
                                How to set a custom environment variable in EMR to be available for a spark Application
                            
                                Spark Streaming: Micro batches Parallel Execution
                            
                                What is "Hadoop" - the definition of Hadoop?
                            
                                Is it possible to have multiple inputs with multiple different mappers in Hadoop MapReduce?
                            
                                How to setup a HTTP Source for testing Flume setup?
                            
                                Read from Kafka and write to hdfs in parquet
                            
                                Hadoop Hbase: Spreading column families across tables or not
                            
                                How to convert a string to timestamp with milliseconds in Hive
                            
                                how to give a custom name to hadoop output files
                            
                                How does HBase enable Random Access to HDFS?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What does in-memory data storage mean in the context of Apache Spark?

Tags:

apache-spark

hadoop

Chidu

People also ask

1 Answers

jkgeyti

Recent Activity

Donate For Us