Different sources (e.g. 1 and 2) claim that Spark can benefit from running multiple tasks in the same JVM. But they don't explain why.
What are these benefits?
As it was already said broadcast variables is one thing.
Another are problems with concurrency. Take a look at this of code:
var counter = 0
var rdd = sc.parallelize(data)
rdd.foreach(x => counter += x)
println(counter)
The result may be different depending, whether executed locally or on a Spark deployed on clusters (with different JVM). In the latter case the parallelize
method splits the computation between the executors. The closure (environment needed for every node to do its task) is computed, which means, that every executor receives a copy of counter
. Each executor sees its own copy of the variable, thus the result of the calculation is 0, as none of the executor referenced the right object. Within one JVM on the other hand counter
is visible to every worker.
Of course there is a way to avoid that - using Acumulator
s (see here).
Last but not least when persisting RDD
s in memory (default cache
method storage level is MEMORY_ONLY
), it will be visible within single JVM. This can also be overcome by using OFF_HEAP
(this is experimental in 2.4.0). More here.
The biggest possible advantage is shared memory, in particular handling broadcasted objects. Because these objects are considered read-only there can be shared between multiple threads.
In scenario when you use a single task / executor you need a copy for each JVM so with N tasks there is N copies. With large objects this can be a serious overhead.
Same logic can be applied to other shared objects.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With