I'm new to Spark, and I found the Documentation says Spark will will load data into memory to make the iteration algorithms faster. But what if I have a log file of 10GB and only have 2GB memory ? Will Spark load the log file into memory as always ?

I think this question has been well answered in the FAQ panel of Spark website (https://spark.apache.org/faq.html): <ul> <li> What happens if my dataset does not fit in memory? Often each partition of data is small and does fit in memory, and these partitions are processed a few at a time. For very large partitions that do not fit in memory, Spark's built-in operators perform external operations on datasets.</li> <li> What happens when a cached dataset does not fit in memory? Spark can either spill it to disk or recompute the partitions that don't fit in RAM each time they are requested. By default, it uses recomputation, but you can set a dataset's storage level to MEMORY_AND_DISK to avoid this.</li> </ul>

What will spark do if I don't have enough memory?

2 Answers

I think this question has been well answered in the FAQ panel of Spark website (https://spark.apache.org/faq.html):

What happens if my dataset does not fit in memory? Often each partition of data is small and does fit in memory, and these partitions are processed a few at a time. For very large partitions that do not fit in memory, Spark's built-in operators perform external operations on datasets.
What happens when a cached dataset does not fit in memory? Spark can either spill it to disk or recompute the partitions that don't fit in RAM each time they are requested. By default, it uses recomputation, but you can set a dataset's storage level to MEMORY_AND_DISK to avoid this.

119

answered Sep 27 '22 21:09

Kehe CAI

The key here is noting that RDDs are split in partitions (see how at the end of this answer), and each partition is a set of elements (can be text lines or integers for instance). Partitions are used to parallelize computations in different computational units.

So the key is not whether a file is too big but whether a partition is. In this case, in the FAQ: "Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data". The issue with large partitions generating OOM is solved here.

Now, even if the partition can fit in memory, such memory can be full. In this case, it evicts another partition from memory to fit the new partition. Evicting can mean either:

Deleting the partition completely: in this case if partition is required again then it is recomputed.
Partition is persisted in storage level specified. Each RDD can be "marked" as to be cached/persisted using this storage levels, see this on how to.

Memory management is well explained here: "Spark stores partitions in LRU cache in memory. When cache hits its limit in size, it evicts the entry (i.e. partition) from it. When the partition has “disk” attribute (i.e. your persistence level allows storing partition on disk), it would be written to HDD and the memory consumed by it would be freed, unless you would request it. When you request it, it would be read into the memory, and if there won’t be enough memory some other, older entries from the cache would be evicted. If your partition does not have “disk” attribute, eviction would simply mean destroying the cache entry without writing it to HDD".

How the initial file/data is partitioned depends on the format and type of data, as well as the function used to create the RDD, see this. For instance:

If you have a collection already (a list in java for example), you can use parallelize() and specify the number of partitions. Elements in the collection will be grouped in partitions.
If using an external file in HDFS: "Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS)".
If reading from a local text file, each line (ended with a new line "\n", end character can be changed, see this) is an element and several lines form a partition.

Finally, I suggest you reading this for more information and also to decide how to choose the number of partitions (too many or too few?).

answered Sep 27 '22 21:09

Marc Cayuela

Related questions
                            
                                How to split parquet files into many partitions in Spark?
                            
                                S3 SlowDown error in Spark on EMR
                            
                                Play! and Spark incompatible Jackson versions
                            
                                Spark + s3 - error - java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
                            
                                How to avoid Spark executor from getting lost and yarn container killing it due to memory limit?
                            
                                Could not find S3 endpoint or NAT gateway for subnetId
                            
                                How to prepare data into a LibSVM format from DataFrame?
                            
                                Spark submit does automatically upload the jar to cluster?
                            
                                How to create a Spark Dataset from an RDD
                            
                                How to name aggregate columns?
                            
                                Passing Arguments in Apache Spark
                            
                                extracting numpy array from Pyspark Dataframe
                            
                                Pyspark dataframe write to single json file with specific name
                            
                                How to split a dataframe into dataframes with same column values?
                            
                                Pandas-style transform of grouped data on PySpark DataFrame
                            
                                Spark: RDD to List
                            
                                `pyspark mllib` versus `pyspark ml` packages
                            
                                Apache Spark Codegen Stage grows beyond 64 KB
                            
                                Azure Databricks - Can not create the managed table The associated location already exists
                            
                                PySpark DataFrames - way to enumerate without converting to Pandas?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What will spark do if I don't have enough memory?

Tags:

apache-spark

WoooHaaaa

People also ask

2 Answers

Kehe CAI

Marc Cayuela

Recent Activity

Donate For Us