Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What happens when the intermediate output does not fit in RAM in Spark

I just started learning Spark. As per my understanding, Spark stores the intermediate output in RAM, so it is very fast compared to Hadoop. Correct me if I am wrong.

My question is, if my intermediate output is 2 GB and my free RAM is 1 GB, then what happens in this case? It may be a silly question, but I have not understood the in-memory concept of Spark. Can anyone please explain me the in-memory concept of Spark?

Thanks

like image 883
Kishore Avatar asked Oct 18 '15 05:10

Kishore


People also ask

How does Apache spark process data that does not fit into the memory?

Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data. Likewise, cached datasets that do not fit in memory are either spilled to disk or recomputed on the fly when needed, as determined by the RDD's storage level.

How does RAM use Spark?

Memory. In general, Spark can run well with anywhere from 8 GiB to hundreds of gigabytes of memory per machine. In all cases, we recommend allocating only at most 75% of the memory for Spark; leave the rest for the operating system and buffer cache.

What is the main mechanism abstraction that Spark uses for managing data in-memory?

The main abstraction of Spark is its RDDs. And the RDDs are cached using the cache() or persist() method. When we use cache() method, all the RDD stores in-memory. When RDD stores the value in memory, the data that does not fit in memory is either recalculated or the excess data is sent to disk.

What is RAM Spark?

In Apache Spark, In-memory computation defines as instead of storing data in some slow disk drives the data is kept in random access memory(RAM). Also, that data is processed in parallel. By using in-memory processing, we can detect a pattern, analyze large data.


2 Answers

This question is asking about RDD persistence in Spark.

You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.

Depending on how you set the storage level for an RDD, different outcomes can be configured. For example, if you set your storage level as MEMORY_ONLY (which is the default storage level), your output will store as much as it can in memory and recompute the rest of your RDD on the fly. You can persist your RDD and apply your storage level like the following: rdd.persist(MEMORY_ONLY).

In your example case, 1GB of your output will be computed and in memory and the other 1GB will be computed when needed for a future step. There are other storage levels that can be set as well depending on your use case:

  1. MEMORY_AND_DISK -- compute the entire RDD but spill some content to disk when necessary
  2. MEMORY_ONLY_SER, MEMORY_AND_DISK_SER -- same as above but all elements are serialized
  3. DISK_ONLY -- store all partitions straight to disk
  4. MEMORY_ONLY_2, MEMORY_AND_DISK_2 -- same as above but partitions are replicated twice for more tolerance

Again, you have to look into your use case to figure out what is the best storage level. In some cases, recomputation of an RDD maybe actually be quicker than loading everything back from disk. In other cases, a fast serializer can get reduce the data grabbed from disk leading to a quick response with the data in question.

like image 117
Rohan Aletty Avatar answered Sep 28 '22 16:09

Rohan Aletty


If I understand your question correctly, I can reply with the following :

The intermediate or temporary storage directory is specified by the spark.local.dir configuration parameter when configuring the Spark context.

The spark.local.dir directory is to use for "scratch" space in Spark, including map output files and RDDs that get stored on disk. [Ref. Spark Configuration.]

This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories on different disks.

Nevertheless, the issue you are addressing here is also called RDD persistence. Among the basics that you should already know using Spark caching, there is also what's referred to as the storage level of an RDD which allows different storage level.

This will allowe you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space), replicate it across nodes, or store it off-heap in Tachyon (This last one is experimental). More information here.

Note: These levels are set by passing a StorageLevel object (Scala, Java, Python) to persist. The cache method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY where Spark stores deserialized objects in memory.

So to answer your question now,

if my intermediate output is 2 GB and my free RAM is 1 GB, then what happens in this case?

I say it depends on how you configure and tune your Spark (app,cluster).

Note: The in-memory in Spark is similar to any in-memory system in the world concept-wise, the main aim is to avoid heavy and expensive IOs. Which also means, if I go back to your question that if you decided to persist on DISK per say, you'll be loosing performance. More on that in the official documentation referenced in the answer.

like image 31
eliasah Avatar answered Sep 28 '22 15:09

eliasah