I'm new to Spark, and I found the Documentation says Spark will will load data into memory to make the iteration algorithms faster.
But what if I have a log file of 10GB and only have 2GB memory ? Will Spark load the log file into memory as always ?
Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data. Likewise, cached datasets that do not fit in memory are either spilled to disk or recomputed on the fly when needed, as determined by the RDD's storage level.
While Spark can perform a lot of its computation in memory, it still uses local disks to store data that doesn't fit in RAM, as well as to preserve intermediate output between stages.
I think this question has been well answered in the FAQ panel of Spark website (https://spark.apache.org/faq.html):
The key here is noting that RDDs are split in partitions (see how at the end of this answer), and each partition is a set of elements (can be text lines or integers for instance). Partitions are used to parallelize computations in different computational units.
So the key is not whether a file is too big but whether a partition is. In this case, in the FAQ: "Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data". The issue with large partitions generating OOM is solved here.
Now, even if the partition can fit in memory, such memory can be full. In this case, it evicts another partition from memory to fit the new partition. Evicting can mean either:
Memory management is well explained here: "Spark stores partitions in LRU cache in memory. When cache hits its limit in size, it evicts the entry (i.e. partition) from it. When the partition has “disk” attribute (i.e. your persistence level allows storing partition on disk), it would be written to HDD and the memory consumed by it would be freed, unless you would request it. When you request it, it would be read into the memory, and if there won’t be enough memory some other, older entries from the cache would be evicted. If your partition does not have “disk” attribute, eviction would simply mean destroying the cache entry without writing it to HDD".
How the initial file/data is partitioned depends on the format and type of data, as well as the function used to create the RDD, see this. For instance:
Finally, I suggest you reading this for more information and also to decide how to choose the number of partitions (too many or too few?).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With