I have a large dataframe that has been cached like <pre class="prettyprint"><code>val largeDf = someLargeDataframe.cache </code></pre> Now I need to union it with a tiny one and cached it again <pre class="prettyprint"><code>val tinyDf = someTinyDataframe.cache val newDataframe = largeDf.union(tinyDf).cached tinyDf.unpersist() largeDf.unpersist() </code></pre> It is very inefficient since it need to re-cached all the data again. Is there any efficient way to add a little amount of data to a large cached dataframe? <hr> After reading Teodors's explanation, I know that I can't unpersist the old dataframe before I do some action on my new dataframe. But what if I need to do something like this? <pre class="prettyprint"><code>def myProcess(df1: Dataframe, df2: Dataframe): Dataframe{ val df1_trans = df1.map(....).cache val df2_trans = df2.map(....).cache doSomeAction(df1_trans, df2_trans) val finalDf = df1_trans.union(df2_trans).map(....).cache // df1_trans.unpersist() // df2_trans.unpersist() finalDf } </code></pre> I want my df1_trans & df2_trans to be cached to improve the performance inside the function since they will be called more than once, but the dataframe I need to return in the end is also constructed by df1_trans & df2_trans, if I can't unpersist them before leaving the function, I can never find other place to do this, however, if I unpersist them, my finalDf will not benefit from cache. What can I do in this situation? Thanks!

<blockquote> Is there any efficient way to add a little amount of data to a large cached dataframe? </blockquote> I don't think any other operation could beat <code>union</code>. I did think that broadcast function might help here, but after having a look at the execution plan I don't think so anymore. That led me to write the answer. If you want to know if your caching has any effect on a query, explain it: <blockquote> explain(): Unit Prints the physical plan to the console for debugging purposes. </blockquote> With the following example, <code>broadcast</code> does not affect <code>union</code> (which is now not surprising given it's a hint for joins and other physical operators just ignore it). <pre class="prettyprint"><code>scala> left.union(broadcast(right)).explain == Physical Plan == Union :- *Range (0, 4, step=1, splits=8) +- *Range (0, 3, step=1, splits=8) </code></pre> It's also worthwhile to use Details for Query under SQL tab. <img src="https://i.stack.imgur.com/CvFwU.png" alt="enter image description here">

Efficient way to join a cached spark dataframe with other and cache again

Tags:

dataframe

caching

union

apache-spark

I have a large dataframe that has been cached like

val largeDf = someLargeDataframe.cache

Now I need to union it with a tiny one and cached it again

val tinyDf = someTinyDataframe.cache
val newDataframe = largeDf.union(tinyDf).cached
tinyDf.unpersist()
largeDf.unpersist()

It is very inefficient since it need to re-cached all the data again. Is there any efficient way to add a little amount of data to a large cached dataframe?

After reading Teodors's explanation, I know that I can't unpersist the old dataframe before I do some action on my new dataframe. But what if I need to do something like this?

def myProcess(df1: Dataframe, df2: Dataframe): Dataframe{
    val df1_trans = df1.map(....).cache
    val df2_trans = df2.map(....).cache

    doSomeAction(df1_trans, df2_trans)

    val finalDf = df1_trans.union(df2_trans).map(....).cache
    // df1_trans.unpersist()
    // df2_trans.unpersist()
    finalDf
}

I want my df1_trans & df2_trans to be cached to improve the performance inside the function since they will be called more than once, but the dataframe I need to return in the end is also constructed by df1_trans & df2_trans, if I can't unpersist them before leaving the function, I can never find other place to do this, however, if I unpersist them, my finalDf will not benefit from cache.

What can I do in this situation? Thanks!

661

asked May 24 '17 07:05

林鼎棋

2 Answers

val largeDf = someLargeDataframe.cache
val tinyDf = someTinyDataframe.cache
val newDataframe = largeDf.union(tinyDf).cache

If you call unpersist() now before any action that goes through all your largeDf dataframe you won't benefit from caching the two dataframes.

tinyDf.unpersist()
largeDf.unpersist()

I wouldn't worry about caching the unioned dataframe as long as the two other dataframes are already cached, you won't likely see a performance hit.

Benchmark the following:

========= now? ============
val largeDf = someLargeDataframe.cache
val tinyDf = someTinyDataframe.cache
val newDataframe = largeDf.union(tinyDf).cache
tinyDf.unpersist()
largeDf.unpersist()
#force evaluation
newDataframe.count()

========= alternative 1 ============
val largeDf = someLargeDataframe.cache
val tinyDf = someTinyDataframe.cache
val newDataframe = largeDf.union(tinyDf).cache

#force evaluation
newDataframe.count()
tinyDf.unpersist()
largeDf.unpersist()

======== alternative 2 ==============
val largeDf = someLargeDataframe.cache
val tinyDf = someTinyDataframe.cache
val newDataframe = largeDf.union(tinyDf)

newDataframe.count()


======== alternative 3 ==============
val largeDf = someLargeDataframe
val tinyDf = someTinyDataframe
val newDataframe = largeDf.union(tinyDf).cache

#force evaluation
newDataframe.count()

answered Sep 27 '22 23:09

Boggio

Is there any efficient way to add a little amount of data to a large cached dataframe?

I don't think any other operation could beat union. I did think that broadcast function might help here, but after having a look at the execution plan I don't think so anymore.

That led me to write the answer. If you want to know if your caching has any effect on a query, explain it:

explain(): Unit Prints the physical plan to the console for debugging purposes.

With the following example, broadcast does not affect union (which is now not surprising given it's a hint for joins and other physical operators just ignore it).

scala> left.union(broadcast(right)).explain
== Physical Plan ==
Union
:- *Range (0, 4, step=1, splits=8)
+- *Range (0, 3, step=1, splits=8)

It's also worthwhile to use Details for Query under SQL tab.

enter image description here

answered Sep 27 '22 23:09

Jacek Laskowski

Related questions
                            
                                Avoid stale data from DB when using JPA?
                            
                                Zend Framework Clearing Cache
                            
                                Spring @Cachable updating data
                            
                                How to find number of conflict misses in a cache simulator
                            
                                Nginx Fastcgi_cache performance - Disk cached VS tmpfs cached VS static file
                            
                                CakePHP added column doesn't appear in query
                            
                                g-wan - reproducing the performance claims
                            
                                Caching of CSS and JS resources in XPages when "use runtime optimized javascript and css resources"
                            
                                Why is including a hash in a filename better for caching than appending a timestamp as a query parameter?
                            
                                Set Cache Redis Expiration to 1 year
                            
                                Setting up server cache on Apache 2.4.7 (ubuntu)
                            
                                How to make @Cacheable return null when not found in cache but do not cache the result?
                            
                                Rails 4.2.5 - Unable to upgrade "config.serve_static_files" to "config.public_file_server.enabled"
                            
                                How to Cache HTTP Requests made with Netflix's Feign library in a Java Spring app
                            
                                Is it necessary to cache bust in HTTP/2?
                            
                                How to configure Etags in Sitecore?
                            
                                Azure Redis Cache max connections reached
                            
                                How can I access <Application_Home>/Library/Caches from a bundle?
                            
                                Prestashop caching of data
                            
                                Can a static member variable be used to cache a value in a static class?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With