I just started learning Spark
. As per my understanding, Spark
stores the intermediate output in RAM, so it is very fast compared to Hadoop
. Correct me if I am wrong.
My question is, if my intermediate output is 2 GB and my free RAM is 1 GB, then what happens in this case? It may be a silly question, but I have not understood the in-memory concept of Spark. Can anyone please explain me the in-memory concept of Spark?
Thanks
Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data. Likewise, cached datasets that do not fit in memory are either spilled to disk or recomputed on the fly when needed, as determined by the RDD's storage level.
Memory. In general, Spark can run well with anywhere from 8 GiB to hundreds of gigabytes of memory per machine. In all cases, we recommend allocating only at most 75% of the memory for Spark; leave the rest for the operating system and buffer cache.
The main abstraction of Spark is its RDDs. And the RDDs are cached using the cache() or persist() method. When we use cache() method, all the RDD stores in-memory. When RDD stores the value in memory, the data that does not fit in memory is either recalculated or the excess data is sent to disk.
In Apache Spark, In-memory computation defines as instead of storing data in some slow disk drives the data is kept in random access memory(RAM). Also, that data is processed in parallel. By using in-memory processing, we can detect a pattern, analyze large data.
This question is asking about RDD persistence in Spark.
You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.
Depending on how you set the storage level for an RDD, different outcomes can be configured. For example, if you set your storage level as MEMORY_ONLY
(which is the default storage level), your output will store as much as it can in memory and recompute the rest of your RDD on the fly. You can persist your RDD and apply your storage level like the following: rdd.persist(MEMORY_ONLY)
.
In your example case, 1GB of your output will be computed and in memory and the other 1GB will be computed when needed for a future step. There are other storage levels that can be set as well depending on your use case:
MEMORY_AND_DISK
-- compute the entire RDD but spill some content to disk when necessaryMEMORY_ONLY_SER
, MEMORY_AND_DISK_SER
-- same as above but all elements are serializedDISK_ONLY
-- store all partitions straight to diskMEMORY_ONLY_2
, MEMORY_AND_DISK_2
-- same as above but partitions are replicated twice for more toleranceAgain, you have to look into your use case to figure out what is the best storage level. In some cases, recomputation of an RDD maybe actually be quicker than loading everything back from disk. In other cases, a fast serializer can get reduce the data grabbed from disk leading to a quick response with the data in question.
If I understand your question correctly, I can reply with the following :
The intermediate or temporary storage directory is specified by the spark.local.dir
configuration parameter when configuring the Spark context.
The spark.local.dir
directory is to use for "scratch" space in Spark, including map output files and RDDs that get stored on disk. [Ref. Spark Configuration.]
This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks.
Nevertheless, the issue you are addressing here is also called RDD persistence. Among the basics that you should already know using Spark caching, there is also what's referred to as the storage level of an RDD which allows different storage level.
This will allowe you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space), replicate it across nodes, or store it off-heap in Tachyon (This last one is experimental). More information here.
Note: These levels are set by passing a StorageLevel object (Scala, Java, Python) to persist
. The cache
method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY
where Spark stores deserialized objects in memory.
So to answer your question now,
if my intermediate output is 2 GB and my free RAM is 1 GB, then what happens in this case?
I say it depends on how you configure and tune your Spark (app,cluster).
Note: The in-memory in Spark is similar to any in-memory system in the world concept-wise, the main aim is to avoid heavy and expensive IOs. Which also means, if I go back to your question that if you decided to persist on DISK per say, you'll be loosing performance. More on that in the official documentation referenced in the answer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With