What happens when the intermediate output does not fit in RAM in Spark

Tags:

I just started learning Spark. As per my understanding, Spark stores the intermediate output in RAM, so it is very fast compared to Hadoop. Correct me if I am wrong.

My question is, if my intermediate output is 2 GB and my free RAM is 1 GB, then what happens in this case? It may be a silly question, but I have not understood the in-memory concept of Spark. Can anyone please explain me the in-memory concept of Spark?

Thanks

883

asked Oct 18 '15 05:10

Kishore

2 Answers

This question is asking about RDD persistence in Spark.

You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.

Depending on how you set the storage level for an RDD, different outcomes can be configured. For example, if you set your storage level as MEMORY_ONLY (which is the default storage level), your output will store as much as it can in memory and recompute the rest of your RDD on the fly. You can persist your RDD and apply your storage level like the following: rdd.persist(MEMORY_ONLY).

In your example case, 1GB of your output will be computed and in memory and the other 1GB will be computed when needed for a future step. There are other storage levels that can be set as well depending on your use case:

MEMORY_AND_DISK -- compute the entire RDD but spill some content to disk when necessary
MEMORY_ONLY_SER, MEMORY_AND_DISK_SER -- same as above but all elements are serialized
DISK_ONLY -- store all partitions straight to disk
MEMORY_ONLY_2, MEMORY_AND_DISK_2 -- same as above but partitions are replicated twice for more tolerance

Again, you have to look into your use case to figure out what is the best storage level. In some cases, recomputation of an RDD maybe actually be quicker than loading everything back from disk. In other cases, a fast serializer can get reduce the data grabbed from disk leading to a quick response with the data in question.

117

answered Sep 28 '22 16:09

Rohan Aletty

If I understand your question correctly, I can reply with the following :

The intermediate or temporary storage directory is specified by the spark.local.dir configuration parameter when configuring the Spark context.

The spark.local.dir directory is to use for "scratch" space in Spark, including map output files and RDDs that get stored on disk. [Ref. Spark Configuration.]

This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks.

Nevertheless, the issue you are addressing here is also called RDD persistence. Among the basics that you should already know using Spark caching, there is also what's referred to as the storage level of an RDD which allows different storage level.

This will allowe you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space), replicate it across nodes, or store it off-heap in Tachyon (This last one is experimental). More information here.

Note: These levels are set by passing a StorageLevel object (Scala, Java, Python) to persist. The cache method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY where Spark stores deserialized objects in memory.

So to answer your question now,

if my intermediate output is 2 GB and my free RAM is 1 GB, then what happens in this case?

I say it depends on how you configure and tune your Spark (app,cluster).

Note: The in-memory in Spark is similar to any in-memory system in the world concept-wise, the main aim is to avoid heavy and expensive IOs. Which also means, if I go back to your question that if you decided to persist on DISK per say, you'll be loosing performance. More on that in the official documentation referenced in the answer.

answered Sep 28 '22 15:09

eliasah

Related questions
                            
                                HBase schema/key for real-time analytics solution
                            
                                HBase setting timestamp
                            
                                Pig approach to pairing data fields in a data set
                            
                                Can apache flume hdfs sink accept dynamic path to write?
                            
                                Load snappy-compressed files into Elastic MapReduce
                            
                                Building Hadoop with Maven - "Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.6:run (create-testdirs)"
                            
                                How to get the SerDe Properties of an existing Hive Table
                            
                                Impala on Hadoop 2.2.0 without CDH?
                            
                                Hadoop maps are failing due to ConnectException
                            
                                Flume: Directory to Avro -> Avro to HDFS - Not valid avro after transfer
                            
                                org.apache.hadoop.mapred.LocalClientProtocolProvider not found
                            
                                Hbase master keeps dying, claims a hbase:namespace already exists
                            
                                Load large csv in hadoop via Hue would only store a 64MB block
                            
                                What is the difference between apache Ambari Server and Agent
                            
                                RHbase/thrift install issue
                            
                                Standard practices for logging in MapReduce jobs
                            
                                Hive transform using Python: Unable to initialize custom script
                            
                                Key of object type in the hadoop mapper
                            
                                Hadoop setting the HADOOP_HOME correctly to bin/hadoop it gives command not found
                            
                                Spark NotSerializableException

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What happens when the intermediate output does not fit in RAM in Spark

Tags:

apache-spark

rdd

hadoop

Kishore

People also ask

2 Answers

Rohan Aletty

eliasah

Recent Activity

Donate For Us