A good question for Spark experts. I am processing data in a <code>map</code> operation (RDD). Within the mapper function, I need to lookup objects of class <code>A</code> to be used in processing of elements in an RDD. Since this will be performed on executors AND creation of elements of type <code>A</code> (that will be looked up) happens to be an expensive operation, I want to pre-load and cache these objects on each executor. What is the best way of doing it? <ul> <li>One idea is to broadcast a lookup table, but class <code>A</code> is not serializable (no control over its implementation).</li> <li>Another idea is to load them up in a singleton object. However, I want to control what gets loaded into that lookup table (e.g. possibly different data on different Spark jobs). </li> </ul> Ideally, I want to specify what will be loaded on executors once (including the case of Streaming, so that the lookup table stays in memory between batches), through a parameter that will be available on the driver during its start-up, before any data gets processed. Is there a clean and elegant way of doing it or is it impossible to achieve?

This is exactly the targeted use case for <code>broadcast.</code> Broadcasted variables are transmitted once and use torrents to move efficiently to all executors, and stay in memory / local disk until you no longer need them. Serialization often pops up as an issue when using others' interfaces. If you can enforce that the objects you consume are serializable, that's going to be the best solution. If this is impossible, your life gets a little more complicated. If you can't serialize the <code>A</code> objects, then you have to create them on the executors for each task. If they're stored in a file somewhere, this would look something like: <pre class="prettyprint"><code>rdd.mapPartitions { it => val lookupTable = loadLookupTable(path) it.map(elem => fn(lookupTable, elem)) } </code></pre> Note that if you're using this model, then you have to load the lookup table once per task -- you can't benefit from the cross-task persistence of broadcast variables. EDIT: Here's another model, which I believe lets you share the lookup table across tasks per JVM. <pre class="prettyprint"><code>class BroadcastableLookupTable { @transient val lookupTable: LookupTable[A] = null def get: LookupTable[A] = { if (lookupTable == null) lookupTable = < load lookup table from disk> lookupTable } } </code></pre> This class can be broadcast (nothing substantial is transmitted) and the first time it's called per JVM, you'll load the lookup table and return it.

Object cache on Spark executors

Tags:

scala

apache-spark

A good question for Spark experts.

I am processing data in a map operation (RDD). Within the mapper function, I need to lookup objects of class A to be used in processing of elements in an RDD.

Since this will be performed on executors AND creation of elements of type A (that will be looked up) happens to be an expensive operation, I want to pre-load and cache these objects on each executor. What is the best way of doing it?

One idea is to broadcast a lookup table, but class A is not serializable (no control over its implementation).
Another idea is to load them up in a singleton object. However, I want to control what gets loaded into that lookup table (e.g. possibly different data on different Spark jobs).

Ideally, I want to specify what will be loaded on executors once (including the case of Streaming, so that the lookup table stays in memory between batches), through a parameter that will be available on the driver during its start-up, before any data gets processed.

Is there a clean and elegant way of doing it or is it impossible to achieve?

495

asked Nov 05 '16 07:11

DruckerBg

2 Answers

This is exactly the targeted use case for broadcast. Broadcasted variables are transmitted once and use torrents to move efficiently to all executors, and stay in memory / local disk until you no longer need them.

Serialization often pops up as an issue when using others' interfaces. If you can enforce that the objects you consume are serializable, that's going to be the best solution. If this is impossible, your life gets a little more complicated. If you can't serialize the A objects, then you have to create them on the executors for each task. If they're stored in a file somewhere, this would look something like:

rdd.mapPartitions { it => 
  val lookupTable = loadLookupTable(path)
  it.map(elem => fn(lookupTable, elem))
}

Note that if you're using this model, then you have to load the lookup table once per task -- you can't benefit from the cross-task persistence of broadcast variables.

EDIT: Here's another model, which I believe lets you share the lookup table across tasks per JVM.

class BroadcastableLookupTable {
  @transient val lookupTable: LookupTable[A] = null

  def get: LookupTable[A] = {
    if (lookupTable == null)
      lookupTable = < load lookup table from disk>
    lookupTable
  }
}

This class can be broadcast (nothing substantial is transmitted) and the first time it's called per JVM, you'll load the lookup table and return it.

190

answered Oct 12 '22 00:10

Tim

In case serialisation turns out to be impossible, how about storing the lookup objects in a database? It's not the easiest solution, granted, but should work just fine. I could recommend checking e.g. spark-redis, but I am sure there are better solution out there.

answered Oct 12 '22 02:10

Lukasz Tracewski

Related questions
                            
                                Scala Slick Lazy Fetch
                            
                                Converting List[Option[A]] to an Option[List[A]] in Scala
                            
                                When is `private[this] def` a performance advantage over `private def`?
                            
                                Monitoring Structured Streaming
                            
                                What's the effect of -Yrangepos other than giving me source locations in macros
                            
                                How to enforce SBT to always fetch sources for project dependencies?
                            
                                Why does scalatest mix up the output?
                            
                                How do I wrap a java.util.concurrent.Future in an Akka Future?
                            
                                Slick outer join with multiple tables
                            
                                How to get intellij to offer a text diff comparison on failed tests
                            
                                SparkR filterRDD and flatMap not working
                            
                                How to run code on startup in Play! framework 2.4
                            
                                Scala implicit conversion from parent trait
                            
                                Connecting to remote master on standalone Spark
                            
                                Avoiding boxing/unboxing within function
                            
                                Looking for a nice way to split an array
                            
                                Scala object initialization
                            
                                Case class constructor argument type depending on the previous argument value
                            
                                SBT sourceGenerators task - execute only if a file changes
                            
                                In Spark, is it possible to share data between two executors?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With