Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are the benefits of running multiple Spark tasks in the same JVM?

Different sources (e.g. 1 and 2) claim that Spark can benefit from running multiple tasks in the same JVM. But they don't explain why.

What are these benefits?

like image 528
Marek Grzenkowicz Avatar asked Dec 04 '17 19:12

Marek Grzenkowicz


2 Answers

As it was already said broadcast variables is one thing.

Another are problems with concurrency. Take a look at this of code:

var counter = 0
var rdd = sc.parallelize(data)

rdd.foreach(x => counter += x)

println(counter)

The result may be different depending, whether executed locally or on a Spark deployed on clusters (with different JVM). In the latter case the parallelize method splits the computation between the executors. The closure (environment needed for every node to do its task) is computed, which means, that every executor receives a copy of counter. Each executor sees its own copy of the variable, thus the result of the calculation is 0, as none of the executor referenced the right object. Within one JVM on the other hand counter is visible to every worker.

Of course there is a way to avoid that - using Acumulators (see here).

Last but not least when persisting RDDs in memory (default cache method storage level is MEMORY_ONLY), it will be visible within single JVM. This can also be overcome by using OFF_HEAP (this is experimental in 2.4.0). More here.

like image 129
Andronicus Avatar answered Oct 27 '22 00:10

Andronicus


The biggest possible advantage is shared memory, in particular handling broadcasted objects. Because these objects are considered read-only there can be shared between multiple threads.

In scenario when you use a single task / executor you need a copy for each JVM so with N tasks there is N copies. With large objects this can be a serious overhead.

Same logic can be applied to other shared objects.

like image 40
Alper t. Turker Avatar answered Oct 26 '22 22:10

Alper t. Turker