In Spark (python):
If sc
is a Spark context (pyspark.SparkContext
), what is the difference between:
r = sc.parallelize([1,2,3,4,5])
and
r = sc.broadcast([1,2,3,4,5])
?
An RDD in Spark is just a collection split into partitions (at least one). Each partition lives on an executor which process it. With sc.parallelize(), your collection is split in partitions assigned to executors, so for example you could have [1,2] on an executor, [3] on another, [4,5] on another one. In this way executors process the partitions in parallel. With broadcast as GwydionFR said, the passed parameter is copied to each executor.
sc.parallelize(...)
spread the data amongst all executors
sc.broadcast(...)
copy the data in the jvm of each executor
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With