Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to perform one operation on each executor once in spark

I have a weka model stored in S3 which is of size around 400MB. Now, I have some set of record on which I want to run the model and perform prediction.

For performing prediction, What I have tried is,

  1. Download and load the model on driver as a static object , broadcast it to all executors. Perform a map operation on prediction RDD. ----> Not working, as in Weka for performing prediction, model object needs to be modified and broadcast require a read-only copy.

  2. Download and load the model on driver as a static object and send it to executor in each map operation. -----> Working (Not efficient, as in each map operation, i am passing 400MB object)

  3. Download the model on driver and load it on each executor and cache it there. (Don't know how to do that)

Does someone have any idea how can I load the model on each executor once and cache it so that for other records I don't load it again?

like image 707
Neha Avatar asked Oct 13 '16 08:10

Neha


People also ask

How do I set the number of executors in a Spark?

Leave 1 core per node for Hadoop/Yarn daemons => Num cores available per node = 16-1 = 15. So, Total available of cores in cluster = 15 x 10 = 150. Number of available executors = (total cores/num-cores-per-executor) = 150/5 = 30. Leaving 1 executor for ApplicationManager => --num-executors = 29.

How does Executor work in Spark?

Executors are worker nodes' processes in charge of running individual tasks in a given Spark job. They are launched at the beginning of a Spark application and typically run for the entire lifetime of an application. Once they have run the task they send the results to the driver.

Where do Spark executors run?

Executor resides in the Worker node. Executors are launched at the start of a Spark Application in coordination with the Cluster Manager. They are dynamically launched and removed by the Driver as per required. To run an individual Task and return the result to the Driver.

How does spark driver and Executor work?

Executors in Spark are the worker nodes that help in running individual tasks by being in charge of a given spark job. These are launched at the beginning of Spark applications, and as soon as the task is run, results are immediately sent to the driver.


1 Answers

You have two options:

1. Create a singleton object with a lazy val representing the data:

    object WekaModel {         lazy val data = {             // initialize data here. This will only happen once per JVM process         }     }        

Then, you can use the lazy val in your map function. The lazy val ensures that each worker JVM initializes their own instance of the data. No serialization or broadcasts will be performed for data.

    elementsRDD.map { element =>         // use WekaModel.data here     } 

Advantages

  • is more efficient, as it allows you to initialize your data once per JVM instance. This approach is a good choice when needing to initialize a database connection pool for example.

Disadvantages

  • Less control over initialization. For example, it's trickier to initialize your object if you require runtime parameters.
  • You can't really free up or release the object if you need to. Usually, that's acceptable, since the OS will free up the resources when the process exits.

2. Use the mapPartition (or foreachPartition) method on the RDD instead of just map.

This allows you to initialize whatever you need for the entire partition.

    elementsRDD.mapPartition { elements =>         val model = new WekaModel()          elements.map { element =>             // use model and element. there is a single instance of model per partition.         }     } 

Advantages:

  • Provides more flexibility in the initialization and deinitialization of objects.

Disadvantages

  • Each partition will create and initialize a new instance of your object. Depending on how many partitions you have per JVM instance, it may or may not be an issue.
like image 119
Dia Kharrat Avatar answered Oct 07 '22 03:10

Dia Kharrat