Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark: Difference between parallelize and broadcast

In Spark (python):

If sc is a Spark context (pyspark.SparkContext), what is the difference between:

r = sc.parallelize([1,2,3,4,5])

and

r = sc.broadcast([1,2,3,4,5]) ?

like image 987
Lior Avatar asked Sep 21 '16 10:09

Lior


2 Answers

An RDD in Spark is just a collection split into partitions (at least one). Each partition lives on an executor which process it. With sc.parallelize(), your collection is split in partitions assigned to executors, so for example you could have [1,2] on an executor, [3] on another, [4,5] on another one. In this way executors process the partitions in parallel. With broadcast as GwydionFR said, the passed parameter is copied to each executor.

like image 56
ppatierno Avatar answered Oct 27 '22 16:10

ppatierno


sc.parallelize(...) spread the data amongst all executors

sc.broadcast(...) copy the data in the jvm of each executor

like image 42
GwydionFR Avatar answered Oct 27 '22 15:10

GwydionFR