I have a weka model stored in S3 which is of size around 400MB. Now, I have some set of record on which I want to run the model and perform prediction. For performing prediction, What I have tried is, <ol> <li>Download and load the model on driver as a static object , broadcast it to all executors. Perform a map operation on prediction RDD. ----> Not working, as in Weka for performing prediction, model object needs to be modified and broadcast require a read-only copy.</li> <li>Download and load the model on driver as a static object and send it to executor in each map operation. -----> Working (Not efficient, as in each map operation, i am passing 400MB object)</li> <li>Download the model on driver and load it on each executor and cache it there. (Don't know how to do that)</li> </ol> Does someone have any idea how can I load the model on each executor once and cache it so that for other records I don't load it again?

You have two options: <h3>1. Create a singleton object with a lazy val representing the data:</h3> <pre class="prettyprint"><code> object WekaModel { lazy val data = { // initialize data here. This will only happen once per JVM process } } </code></pre> Then, you can use the lazy val in your <code>map</code> function. The <code>lazy val</code> ensures that each worker JVM initializes their own instance of the data. No serialization or broadcasts will be performed for <code>data</code>. <pre class="prettyprint"><code> elementsRDD.map { element => // use WekaModel.data here } </code></pre> Advantages <ul> <li>is more efficient, as it allows you to initialize your data once per JVM instance. This approach is a good choice when needing to initialize a database connection pool for example.</li> </ul> Disadvantages <ul> <li>Less control over initialization. For example, it's trickier to initialize your object if you require runtime parameters.</li> <li>You can't really free up or release the object if you need to. Usually, that's acceptable, since the OS will free up the resources when the process exits.</li> </ul> <h3>2. Use the <code>mapPartition</code> (or <code>foreachPartition</code>) method on the RDD instead of just <code>map</code>.</h3> This allows you to initialize whatever you need for the entire partition. <pre class="prettyprint"><code> elementsRDD.mapPartition { elements => val model = new WekaModel() elements.map { element => // use model and element. there is a single instance of model per partition. } } </code></pre> Advantages: <ul> <li>Provides more flexibility in the initialization and deinitialization of objects.</li> </ul> Disadvantages <ul> <li>Each partition will create and initialize a new instance of your object. Depending on how many partitions you have per JVM instance, it may or may not be an issue.</li> </ul>

How to perform one operation on each executor once in spark

1 Answers

You have two options:

1. Create a singleton object with a lazy val representing the data:

    object WekaModel {         lazy val data = {             // initialize data here. This will only happen once per JVM process         }     }

Then, you can use the lazy val in your map function. The lazy val ensures that each worker JVM initializes their own instance of the data. No serialization or broadcasts will be performed for data.

    elementsRDD.map { element =>         // use WekaModel.data here     }

Advantages

is more efficient, as it allows you to initialize your data once per JVM instance. This approach is a good choice when needing to initialize a database connection pool for example.

Disadvantages

Less control over initialization. For example, it's trickier to initialize your object if you require runtime parameters.
You can't really free up or release the object if you need to. Usually, that's acceptable, since the OS will free up the resources when the process exits.

2. Use the `mapPartition` (or `foreachPartition`) method on the RDD instead of just `map`.

This allows you to initialize whatever you need for the entire partition.

    elementsRDD.mapPartition { elements =>         val model = new WekaModel()          elements.map { element =>             // use model and element. there is a single instance of model per partition.         }     }

Advantages:

Provides more flexibility in the initialization and deinitialization of objects.

Disadvantages

Each partition will create and initialize a new instance of your object. Depending on how many partitions you have per JVM instance, it may or may not be an issue.

119

answered Oct 07 '22 03:10

Dia Kharrat

Related questions
                            
                                What is/are the Scala way(s) to implement this Java "byte[] to Hex" class
                            
                                Case Classes with optional fields in Scala
                            
                                How to get the element index when mapping an array in Scala?
                            
                                A better way to test the value of an Option?
                            
                                In Scala, how to use Ordering[T] with List.min or List.max and keep code readable
                            
                                Which version of Java does SBT use?
                            
                                Scala and Java BigDecimal
                            
                                Recursively create directory
                            
                                Code exercising the unique possibilities of each edge of the lambda calculus
                            
                                Can Map be performed on a Scala HList
                            
                                Is it possible to use a Java 8 style method references in Scala?
                            
                                Scala: sliding(N,N) vs grouped(N)
                            
                                How to read gzip'd file in Scala
                            
                                What does Scala's "try" mean without either a catch or finally block?
                            
                                Blocking calls in Akka Actors
                            
                                Why do I get `java.lang.NoClassDefFoundError: scala/Function1` when I run my code in ScalaIDE?
                            
                                Scala: public getter with private setter?
                            
                                Cake pattern with Java8 possible?
                            
                                Get companion object of class by given generic type Scala
                            
                                Avoiding accidental removal of duplicates when mapping a Set

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to perform one operation on each executor once in spark

Tags:

scala

apache-spark

partitioning

weka

Neha

People also ask

1 Answers

1. Create a singleton object with a lazy val representing the data:

2. Use the `mapPartition` (or `foreachPartition`) method on the RDD instead of just `map`.

Dia Kharrat

Recent Activity

Donate For Us

How to perform one operation on each executor once in spark

Tags:

scala

apache-spark

partitioning

weka

Neha

People also ask

1 Answers

1. Create a singleton object with a lazy val representing the data:

2. Use the mapPartition (or foreachPartition) method on the RDD instead of just map.

Dia Kharrat

Related questions

Recent Activity

Donate For Us

2. Use the `mapPartition` (or `foreachPartition`) method on the RDD instead of just `map`.