In Apache spark, what is the difference between using mapPartitions and combine use of broadcast variable and map

Tags:

In Spark, we use broadcast variable to make each machine have read only copy of a variable. We usually create a broadcast variable outside closure (Such as a look up table needed by the closure) to improve performance.

We also have a spark transformation operator called mapPartitions, which tried to achieve the same thing (Use shared variable to improve performance). For example, in mapPartitions we can shared a database connection for each partitions.

So what's the difference between these two? Can we use it interchangebly just for shared variables?

727

asked Dec 28 '15 21:12

xuanyue

2 Answers

While the answer provided by KrisP highlights all the important differences I think it is worth noting that mapPartitions is just a low level building block behind higher level transformations not a method to achieve shared state.

Although mapPartitions can be used to make shared-liked state explicit it technically not shared (its lifetime is limited to mapPartitions closure`) and there are other means to achieve it. In particular, variables which are referenced inside closures are shared inside a partition. To illustrate that lets play a little with singletons:

object DummySharedState {
  var i = 0L
  def get(x: Any) =  {
    i += 1L
    i
  }
}

sc.parallelize(1 to 100, 1).map(DummySharedState.get).max
// res3: Long = 100
sc.parallelize(1 to 100, 2).map(DummySharedState.get).max
// res4: Long = 50
sc.parallelize(1 to 100, 50).map(DummySharedState.get).max
// res5: Long = 2

and a similar thing in PySpark:

singleton module dummy_shared_state.py:

i = 0
def get(x):
    global i
    i += 1
    return i

main script:

from pyspark import SparkConf, SparkContext
import dummy_shared_state

master = "spark://..."
conf = (SparkConf()
    .setMaster(master)
    .set("spark.python.worker.reuse", "false"))

sc.addPyFile("dummy_shared_state.py")
sc.parallelize(range(100), 1).map(dummy_shared_state.get).max()
## 100
sc.parallelize(range(100), 2).map(dummy_shared_state.get).max()
## 50

Please note that spark.python.worker.reuse option is set to false. If you keep default value you'll actually see something like this:

sc.parallelize(range(100), 2).map(dummy_shared_state.get).max()
## 50
sc.parallelize(range(100), 2).map(dummy_shared_state.get).max()
## 100
sc.parallelize(range(100), 2).map(dummy_shared_state.get).max()
## 150

At the end of the day you have to distinguish between three different things:

broadcast variables which are designed to reduce network traffic an memory footprint by keeping a copy of the variable on the worker instead of shipping it with each task
variables defined outside closure and referenced inside closure which has to be shipped with each task and are shared for this task
variables defined inside closure which are not shared

On top of that there are some Python specific gotchas related to the usage of persistent interpreters.

Still there is no practical difference between map (filter or other transformation) and mapPartitions when it comes to variable lifetime.

answered Oct 02 '22 00:10

zero323

broadcast is used to ship the object to every worker node. This object is going to be shared among all partitions on that node (and the value/i.e. object is the same for every node in the cluster). The goal of broadcasting is to save on network costs when you use the same data in many different tasks/partitions on the worker node.

mapPartitions in contrast, is a method available on RDDs, and works like map, only on partitions. Yes, you can define new objects, such as a jdbc connection, which will then be unique to each partition. However, you can't share it among different partitions, and much less among different nodes.

answered Oct 02 '22 00:10

KrisP

Related questions
                            
                                Java interview puzzle related to set [duplicate]
                            
                                Using Java BigDecimal still not correctly solving
                            
                                JOOQ - equivalent of hibernate interceptor for populating history fields?
                            
                                plugin with id spring-boot not found in parent build.gradle
                            
                                MediaHTTPConnection: readAt 3110239 / 32768 => java.net.ProtocolException Error
                            
                                How to debug Eclipse crash on startup?
                            
                                Intellij formatter chained method calls
                            
                                Unable to hit the Servlet page
                            
                                What does the new keyword do here? [duplicate]
                            
                                How to download a file with Retrofit2?
                            
                                How do I lay out Input panel with multiple textfields and OK, CANCEL buttons?
                            
                                Permission changed callback in Android 6.0
                            
                                QueryDSL returning max value
                            
                                How to import gradle project into STS 3.7.2
                            
                                Implement a RecyclerView in a Fragment
                            
                                AEM 6.1: Enable Rich text editor (RTE) plugins on Touch UI
                            
                                Configuring postgresql driver through Spring xml datasource
                            
                                Can toString() method take an argument in Java?
                            
                                java detect when any window is created or closed
                            
                                Abstracting try/catch with Function in Java 8

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

In Apache spark, what is the difference between using mapPartitions and combine use of broadcast variable and map

Tags:

java

scala

apache-spark

xuanyue

People also ask

2 Answers

zero323

KrisP

Recent Activity

Donate For Us