I'm building a Spark application where I have to cache about 15 GB of CSV files. I read about the new UnifiedMemoryManager
introduced in Spark 1.6 here:
https://0x0fff.com/spark-memory-management/
It shows also this picture:
The author differs between User Memory
and Spark Memory
(which is again splitted into Storage and Execution Memory
). As I understud, the Spark Memory is flexible for execution (shuffle, sort etc) and storing (caching) stuff - If one needs more memory it can use it from the other part (if not already completly used). Is this assumption correct?
The User Memory is described like this:
User Memory. This is the memory pool that remains after the allocation of Spark Memory, and it is completely up to you to use it in a way you like. You can store your own data structures there that would be used in RDD transformations. For example, you can rewrite Spark aggregation by using mapPartitions transformation maintaining hash table for this aggregation to run, which would consume so called User Memory. [...] And again, this is the User Memory and its completely up to you what would be stored in this RAM and how, Spark makes completely no accounting on what you do there and whether you respect this boundary or not. Not respecting this boundary in your code might cause OOM error.
How can I access this part of the memory or how is this managed by Spark?
And for my purpose I just have to have enough Storage memory (as I don't do things like shuffle, join etc.)? So, can I set the spark.memory.storageFraction
property to 1.0?
The most important question to me is, what about the User Memory? Wherefore is it, especially for my purpose that I described above?
Is there a difference in using the Memory when I change the program to use some own classes e.g. RDD<MyOwnRepresentationClass>
instead of RDD<String>
?
Here is my code snippet (calling it many times from Livy Client
in a benchmark application. I'm using Spark 1.6.2 with Kryo serialization.
JavaRDD<String> inputRDD = sc.textFile(inputFile);
// Filter out invalid values
JavaRDD<String> cachedRDD = inputRDD.filter(new Function<String, Boolean>() {
@Override
public Boolean call(String row) throws Exception {
String[] parts = row.split(";");
// Some filtering stuff
return hasFailure;
}
}).persist(StorageLevel.MEMORY_ONLY_SER());
User Memory = Usable Memory * (1.0 — spark.memory.fraction) = 4820 MB * (1.0 - 0.6) = 4820 MB * 0.4. = 1928 MB. Spark Memory = Usable Memory * spark.memory.fraction.
Determine the memory resources available for the Spark application. Multiply the cluster RAM size by the YARN utilization percentage. Provides 5 GB RAM for available drivers and 50 GB RAM available for worker nodes. Discount 1 core per worker node to determine the executor core instances.
In-memory cluster computation enables Spark to run iterative algorithms, as programs can checkpoint data and refer back to it without reloading it from disk; in addition, it supports interactive querying and streaming data analysis at extremely fast speeds.
unified memory manager
1) on HEAP: Objects are allocated on the JVM heap and bound by GC.
2) OFF HEAP: Objects are allocated in memory outside the JVM by serialization, managed by the application, and are not bound by GC. This memory management method can avoid frequent GC, but the disadvantage is that you have to write the logic of memory allocation and memory release.
ON HEAP :
Storage Memory: It's mainly used to store Spark cache data, such as RDD cache, Broadcast variable, Unroll data, and so on.
Execution Memory/shuffle memory: It's mainly used to store temporary data in the calculation process of Shuffle, Join, Sort, Aggregation, etc.
User Memory: It's mainly used to store the data needed for RDD conversion operations, such as the information for RDD dependency.
Reserved Memory: The memory is reserved for system and is used to store Spark's internal objects.
OFF HEAP MEMORY : - 1) Storage Memory ( shuffle memory) 2) Execution Memory
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With