What is the difference between SPARK Partitions and Worker Cores?

Question

I used the Standalone Spark Cluster to process several files. When I executed the Driver, the data was processed on each worker using it's cores.

Now, I've read about Partitions, but I didn't get it if it's different than Worker Cores or not.

Is there a difference between setting cores number and partition numbers?

rakesh · Accepted Answer

Simplistic view: Partition vs Number of Cores

When you invoke an action an RDD,

A "Job" is created for it. So, Job is a work submitted to spark.
Jobs are divided in to "STAGE" based n the shuffle boundary!!!
Each stage is further divided to tasks based on the number of partitions on the RDD. So Task is smallest unit of work for spark.
Now, how many of these tasks can be executed simultaneously depends on the "Number of Cores" available!!!

Tim · Answer

Partition (or task) refers to a unit of work. If you have a 200G hadoop file loaded as an RDD and chunked by 128M (Spark default), then you have ~2000 partitions in this RDD. The number of cores determines how many partitions can be processed at any one time, and up to 2000 (capped at the number of partitions/tasks) can execute this RDD in parallel.

What is the difference between SPARK Partitions and Worker Cores?

Tags:

java

apache-spark

hadoop

hasan.alkhatib

2 Answers

rakesh

Tim

Recent Activity

Donate For Us

What is the difference between SPARK Partitions and Worker Cores?

Tags:

java

apache-spark

hadoop

hasan.alkhatib

2 Answers

rakesh

Tim

Related questions

Recent Activity

Donate For Us