Apache Spark: User Memory vs Spark Memory

Tags:

I'm building a Spark application where I have to cache about 15 GB of CSV files. I read about the new UnifiedMemoryManager introduced in Spark 1.6 here:

https://0x0fff.com/spark-memory-management/

It shows also this picture: enter image description here

The author differs between User Memory and Spark Memory (which is again splitted into Storage and Execution Memory). As I understud, the Spark Memory is flexible for execution (shuffle, sort etc) and storing (caching) stuff - If one needs more memory it can use it from the other part (if not already completly used). Is this assumption correct?

The User Memory is described like this:

User Memory. This is the memory pool that remains after the allocation of Spark Memory, and it is completely up to you to use it in a way you like. You can store your own data structures there that would be used in RDD transformations. For example, you can rewrite Spark aggregation by using mapPartitions transformation maintaining hash table for this aggregation to run, which would consume so called User Memory. [...] And again, this is the User Memory and its completely up to you what would be stored in this RAM and how, Spark makes completely no accounting on what you do there and whether you respect this boundary or not. Not respecting this boundary in your code might cause OOM error.

How can I access this part of the memory or how is this managed by Spark?

And for my purpose I just have to have enough Storage memory (as I don't do things like shuffle, join etc.)? So, can I set the spark.memory.storageFraction property to 1.0?

The most important question to me is, what about the User Memory? Wherefore is it, especially for my purpose that I described above?

Is there a difference in using the Memory when I change the program to use some own classes e.g. RDD<MyOwnRepresentationClass> instead of RDD<String>?

Here is my code snippet (calling it many times from Livy Client in a benchmark application. I'm using Spark 1.6.2 with Kryo serialization.

JavaRDD<String> inputRDD = sc.textFile(inputFile);

// Filter out invalid values
JavaRDD<String> cachedRDD = inputRDD.filter(new Function<String, Boolean>() {
    @Override
    public Boolean call(String row) throws Exception {
        String[] parts = row.split(";");

        // Some filtering stuff

        return hasFailure;
    }
}).persist(StorageLevel.MEMORY_ONLY_SER());

622

asked May 03 '17 09:05

D. Müller

1 Answers

unified memory manager

1) on HEAP: Objects are allocated on the JVM heap and bound by GC.

2) OFF HEAP: Objects are allocated in memory outside the JVM by serialization, managed by the application, and are not bound by GC. This memory management method can avoid frequent GC, but the disadvantage is that you have to write the logic of memory allocation and memory release.

ON HEAP :

Storage Memory: It's mainly used to store Spark cache data, such as RDD cache, Broadcast variable, Unroll data, and so on.

Execution Memory/shuffle memory: It's mainly used to store temporary data in the calculation process of Shuffle, Join, Sort, Aggregation, etc.

User Memory: It's mainly used to store the data needed for RDD conversion operations, such as the information for RDD dependency.

Reserved Memory: The memory is reserved for system and is used to store Spark's internal objects.

OFF HEAP MEMORY : - 1) Storage Memory ( shuffle memory) 2) Execution Memory

answered Oct 29 '22 02:10

dalwinder singh

Related questions
                            
                                Mod_pagespeed delete cache?
                            
                                UnitOfWork in Action Filter seems to be caching
                            
                                WordPress website still loading old style.css
                            
                                Answering HTTP_IF_MODIFIED_SINCE and HTTP_IF_NONE_MATCH in PHP
                            
                                What does it mean that ConcurrentLinkedHashMap has been integrated into Guava?
                            
                                How do I implement a HTML cache for a PHP site?
                            
                                JavaScript: how to force Image() not to use the browser cache?
                            
                                ViewPager + FragmentStatePagerAdapter + orientation change
                            
                                Weird Laravel 5 caching using wrong database name
                            
                                When did caching in a UIWebView start working?
                            
                                Nginx: In which order rate limiting and caching are executed?
                            
                                How to decode/decompress values from memcached-backed Rails cache (Dalli gem) in Node.js
                            
                                Is there any workaround to "reserve" a cache fraction?
                            
                                Glide Cache does not persist when app is killed
                            
                                Multiple region caches with Doctrine 2 second level cache and Symfony 3.3
                            
                                What methods of caching, other than to file or database, are available?
                            
                                ConcurrentModificationException when updating stored Iterator (for LRU cache implementation)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Apache Spark: User Memory vs Spark Memory

Tags:

memory-management

memory

caching

apache-spark

rdd

D. Müller

People also ask

1 Answers

dalwinder singh

Recent Activity

Donate For Us