What are the benefits of running multiple Spark tasks in the same JVM?

Question

Different sources (e.g. 1 and 2) claim that Spark can benefit from running multiple tasks in the same JVM. But they don't explain why.

What are these benefits?

Andronicus · Accepted Answer

As it was already said broadcast variables is one thing.

Another are problems with concurrency. Take a look at this of code:

var counter = 0
var rdd = sc.parallelize(data)

rdd.foreach(x => counter += x)

println(counter)

The result may be different depending, whether executed locally or on a Spark deployed on clusters (with different JVM). In the latter case the parallelize method splits the computation between the executors. The closure (environment needed for every node to do its task) is computed, which means, that every executor receives a copy of counter. Each executor sees its own copy of the variable, thus the result of the calculation is 0, as none of the executor referenced the right object. Within one JVM on the other hand counter is visible to every worker.

Of course there is a way to avoid that - using Acumulators (see here).

Last but not least when persisting RDDs in memory (default cache method storage level is MEMORY_ONLY), it will be visible within single JVM. This can also be overcome by using OFF_HEAP (this is experimental in 2.4.0). More here.

Alper t. Turker · Answer

The biggest possible advantage is shared memory, in particular handling broadcasted objects. Because these objects are considered read-only there can be shared between multiple threads.

In scenario when you use a single task / executor you need a copy for each JVM so with N tasks there is N copies. With large objects this can be a serious overhead.

Same logic can be applied to other shared objects.

What are the benefits of running multiple Spark tasks in the same JVM?

Tags:

java

performance

jvm

scala

apache-spark

Marek Grzenkowicz

2 Answers

Andronicus

Alper t. Turker

Recent Activity

Donate For Us

What are the benefits of running multiple Spark tasks in the same JVM?

Tags:

java

performance

jvm

scala

apache-spark

Marek Grzenkowicz

2 Answers

Andronicus

Alper t. Turker

Related questions

Recent Activity

Donate For Us