Apache Spark: Difference between parallelize and broadcast

Question

In Spark (python):

If sc is a Spark context (pyspark.SparkContext), what is the difference between:

r = sc.parallelize([1,2,3,4,5])

and

r = sc.broadcast([1,2,3,4,5]) ?

ppatierno · Accepted Answer

An RDD in Spark is just a collection split into partitions (at least one). Each partition lives on an executor which process it. With sc.parallelize(), your collection is split in partitions assigned to executors, so for example you could have [1,2] on an executor, [3] on another, [4,5] on another one. In this way executors process the partitions in parallel. With broadcast as GwydionFR said, the passed parameter is copied to each executor.

GwydionFR · Answer

sc.parallelize(...) spread the data amongst all executors

sc.broadcast(...) copy the data in the jvm of each executor

Apache Spark: Difference between parallelize and broadcast

Tags:

apache-spark

pyspark

Lior

2 Answers

ppatierno

GwydionFR

Recent Activity

Donate For Us

Apache Spark: Difference between parallelize and broadcast

Tags:

apache-spark

pyspark

Lior

2 Answers

ppatierno

GwydionFR

Related questions

Recent Activity

Donate For Us