I used the Standalone Spark Cluster
to process several files. When I executed the Driver, the data was processed on each worker using it's cores.
Now, I've read about Partitions
, but I didn't get it if it's different than Worker Cores or not.
Is there a difference between setting cores number
and partition numbers
?
Simplistic view: Partition vs Number of Cores
When you invoke an action an RDD,
Partition (or task) refers to a unit of work. If you have a 200G hadoop file loaded as an RDD and chunked by 128M (Spark default), then you have ~2000 partitions in this RDD. The number of cores determines how many partitions can be processed at any one time, and up to 2000 (capped at the number of partitions/tasks) can execute this RDD in parallel.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With