I have a situation where I want to execute a system process on each worker within Spark. I want this process to be run an each machine once. Specifically this process starts a daemon which is required to be running before the rest of my program executes. Ideally this should execute before I've read any data in.
I'm on Spark 2.0.2 and using dynamic allocation.
Yes, A worker node can be holding multiple executors (processes) if it has sufficient CPU, Memory and Storage.
Go to the Apache Spark Installation directory from the command line and type bin/spark-shell and press enter, this launches Spark shell and gives you a scala prompt to interact with Spark in scala language. If you have set the Spark in a PATH then just enter spark-shell in command line or terminal (mac users).
WORKERS. Workers (slaves) are running Spark instances where executors live to execute tasks. They are the compute nodes in Spark. A worker receives serialized tasks that it runs in a thread pool. It hosts a local Block Manager that serves blocks to other workers in a Spark cluster.
You may be able to achieve this with a combination of lazy val and Spark broadcast. It will be something like below. (Have not compiled below code, you may have to change few things)
object ProcessManager {
lazy val start = // start your process here.
}
You can broadcast this object at the start of your application before you do any transformations.
val pm = sc.broadcast(ProcessManager)
Now, you can access this object inside your transformation like you do with any other broadcast variables and invoke the lazy val.
rdd.mapPartition(itr => {
pm.value.start
// Other stuff here.
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With