Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark: User Memory vs Spark Memory

I'm building a Spark application where I have to cache about 15 GB of CSV files. I read about the new UnifiedMemoryManager introduced in Spark 1.6 here:

https://0x0fff.com/spark-memory-management/

It shows also this picture: enter image description here

The author differs between User Memory and Spark Memory (which is again splitted into Storage and Execution Memory). As I understud, the Spark Memory is flexible for execution (shuffle, sort etc) and storing (caching) stuff - If one needs more memory it can use it from the other part (if not already completly used). Is this assumption correct?

The User Memory is described like this:

User Memory. This is the memory pool that remains after the allocation of Spark Memory, and it is completely up to you to use it in a way you like. You can store your own data structures there that would be used in RDD transformations. For example, you can rewrite Spark aggregation by using mapPartitions transformation maintaining hash table for this aggregation to run, which would consume so called User Memory. [...] And again, this is the User Memory and its completely up to you what would be stored in this RAM and how, Spark makes completely no accounting on what you do there and whether you respect this boundary or not. Not respecting this boundary in your code might cause OOM error.

How can I access this part of the memory or how is this managed by Spark?

And for my purpose I just have to have enough Storage memory (as I don't do things like shuffle, join etc.)? So, can I set the spark.memory.storageFraction property to 1.0?

The most important question to me is, what about the User Memory? Wherefore is it, especially for my purpose that I described above?

Is there a difference in using the Memory when I change the program to use some own classes e.g. RDD<MyOwnRepresentationClass> instead of RDD<String>?

Here is my code snippet (calling it many times from Livy Client in a benchmark application. I'm using Spark 1.6.2 with Kryo serialization.

JavaRDD<String> inputRDD = sc.textFile(inputFile);

// Filter out invalid values
JavaRDD<String> cachedRDD = inputRDD.filter(new Function<String, Boolean>() {
    @Override
    public Boolean call(String row) throws Exception {
        String[] parts = row.split(";");

        // Some filtering stuff

        return hasFailure;
    }
}).persist(StorageLevel.MEMORY_ONLY_SER());
like image 622
D. Müller Avatar asked May 03 '17 09:05

D. Müller


People also ask

What is user memory in Spark?

User Memory = Usable Memory * (1.0 — spark.memory.fraction) = 4820 MB * (1.0 - 0.6) = 4820 MB * 0.4. = 1928 MB. Spark Memory = Usable Memory * spark.memory.fraction.

How is Spark memory calculated?

Determine the memory resources available for the Spark application. Multiply the cluster RAM size by the YARN utilization percentage. Provides 5 GB RAM for available drivers and 50 GB RAM available for worker nodes. Discount 1 core per worker node to determine the executor core instances.

What is in-memory in Apache spark?

In-memory cluster computation enables Spark to run iterative algorithms, as programs can checkpoint data and refer back to it without reloading it from disk; in addition, it supports interactive querying and streaming data analysis at extremely fast speeds.


1 Answers

unified memory manager

1) on HEAP: Objects are allocated on the JVM heap and bound by GC.

2) OFF HEAP: Objects are allocated in memory outside the JVM by serialization, managed by the application, and are not bound by GC. This memory management method can avoid frequent GC, but the disadvantage is that you have to write the logic of memory allocation and memory release.

ON HEAP :

Storage Memory: It's mainly used to store Spark cache data, such as RDD cache, Broadcast variable, Unroll data, and so on.

Execution Memory/shuffle memory: It's mainly used to store temporary data in the calculation process of Shuffle, Join, Sort, Aggregation, etc.

User Memory: It's mainly used to store the data needed for RDD conversion operations, such as the information for RDD dependency.

Reserved Memory: The memory is reserved for system and is used to store Spark's internal objects.

OFF HEAP MEMORY : - 1) Storage Memory ( shuffle memory) 2) Execution Memory

like image 97
dalwinder singh Avatar answered Oct 29 '22 02:10

dalwinder singh