spark cache only keeps a fraction of RDD

Tags:

When I explicitly call rdd.cache, I can see from the spark console storage tab that only a fraction of the rdd is actually cached. My question is where are the remaining parts? How does Spark decide which part to leave in cache?

The same question applies to the initial raw data read in by sc.textFile(). I understand these rdd's are automatically cached, even though the spark console storage table does not display any information on their cache status. Do we know how much of those are cached vs. missing?

325

asked Apr 07 '15 22:04

bhomass

1 Answers

cache() is the same as persist(StorageLevel.MEMORY_ONLY), and your amount of data probably exceeds the available memory. Spark then evicts caches in a "least recently used" manner.

You can tweak the reserved memory for caching by setting configuration options. See the Spark Documentation for details and look out for: spark.driver.memory, spark.executor.memory, spark.storage.memoryFraction

Not an expert, but I do not think that textFile() automatically caches anything; the Spark Quick Start explicitly caches a text file RDD: sc.textFile(logFile, 2).cache()

147

answered Oct 21 '22 10:10

stholzm

Related questions
                            
                                What is the default serialization used by the ASP.net HttpRuntime.Cache
                            
                                third-party Caching software- what do they provide?
                            
                                Hibernate lazy loading and Hazelcast
                            
                                How to cache data in JavaScript for non-sequential shifting range?
                            
                                how long, by default does stuff stay in httpcache if i don't put an explicit expiration?
                            
                                Caching in Node Express: How do you whitelist/blacklist views?
                            
                                Whats the difference between these difference cache-control params?
                            
                                Is there an attribute similar to OutputCache that I can use on normal C# methods?
                            
                                AppFabric 1.1 Caching (crashing windows service)
                            
                                MemoryCache UpdateCallback not working
                            
                                Caching DNS queries in Ruby
                            
                                Should I check if 'is not modified' for most responses?
                            
                                How to disable browser caching in Vaadin
                            
                                Unwanted underscore and unix timestamp appearing after javascript include
                            
                                How to implement cache system in php for json api
                            
                                W3TOTAL CACHE : Disk Enhanced Vs Disk Basic
                            
                                Is there a new version of ehcache-core that is part of the latest versions of Ehcache?
                            
                                Prevent backup reads from getting into linux page cache
                            
                                RxJava Android - Load-cache-display data on proper threads
                            
                                proxy_cache_min_uses time window

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

spark cache only keeps a fraction of RDD

Tags:

caching

swap

apache-spark

bhomass

People also ask

1 Answers

stholzm

Recent Activity

Donate For Us