Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between SPARK Partitions and Worker Cores?

I used the Standalone Spark Cluster to process several files. When I executed the Driver, the data was processed on each worker using it's cores.

Now, I've read about Partitions, but I didn't get it if it's different than Worker Cores or not.

Is there a difference between setting cores number and partition numbers?

like image 915
hasan.alkhatib Avatar asked Dec 08 '22 20:12

hasan.alkhatib


2 Answers

Simplistic view: Partition vs Number of Cores

When you invoke an action an RDD,

  • A "Job" is created for it. So, Job is a work submitted to spark.
  • Jobs are divided in to "STAGE" based n the shuffle boundary!!!
  • Each stage is further divided to tasks based on the number of partitions on the RDD. So Task is smallest unit of work for spark.
  • Now, how many of these tasks can be executed simultaneously depends on the "Number of Cores" available!!!
like image 68
rakesh Avatar answered Feb 16 '23 01:02

rakesh


Partition (or task) refers to a unit of work. If you have a 200G hadoop file loaded as an RDD and chunked by 128M (Spark default), then you have ~2000 partitions in this RDD. The number of cores determines how many partitions can be processed at any one time, and up to 2000 (capped at the number of partitions/tasks) can execute this RDD in parallel.

like image 30
Tim Avatar answered Feb 16 '23 01:02

Tim