Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the relationship between tasks and partitions?

Tags:

apache-spark

Can I say?

  1. The number of the Spark tasks equal to the number of the Spark partitions?

  2. The executor runs once (batch inside of executor) is equal to one task?

  3. Every task produce only a partition?

  4. (duplicate of 1.)

like image 355
cdhit Avatar asked Dec 12 '17 21:12

cdhit


People also ask

How many tasks does Spark run on each partition?

Spark assigns one task per partition and each worker can process one task at a time.

How are partitions determined in Spark?

Spark partitions number Best way to decide a number of spark partitions in an RDD is to make the number of partitions equal to the number of cores over the cluster. This results in all the partitions will process in parallel. Also, use of resources will do in an optimal way.

What is a task in Spark?

Task is the smallest execution unit in Spark. A task in spark executes a series of instructions. For eg. reading data, filtering and applying map() on data can be combined into a task. Tasks are executed inside an executor.

What is difference between stage and task in Spark?

In Apache Spark, a stage is a physical unit of execution. We can say, it is a step in a physical execution plan. It is a set of parallel tasks — one task per partition. In other words, each job gets divided into smaller sets of tasks, is what you call stages.


2 Answers

The degree of parallelism, or the number of tasks that can be ran concurrently, is set by:

  • The number of Executor Instances (configuration)
  • The Number of Cores per Executor (configuration)
  • The Number of Partitions being used (coded)

Actual parallelism is the smaller of

  • executors * cores - which gives the amount of slots available to run tasks
  • partitions - each partition will translate to a task whenever a slot opens up.

Tasks that run on the same executor will share the same JVM. This is used by the Broadcast feature as you only need one copy of the Broadcast data per Executor for all tasks to be able to access it through shared memory.

You can have multiple executors running, on the same machine, or on different machines. Executors are the true means of scalability.

Note that each Task takes up one Thread ¹, and is assumed to be assigned to one core ².

So -

  1. Is the number of the Spark tasks equal to the number of the Spark partitions?

No (see previous).

  1. The executor runs once (batch inside of executor) is equal to one task?

An Executor is started as an environment for the tasks to run. Multiple tasks will run concurrently within that Executor (multiple threads).

  1. Every task produce only a partition?

For a task, it is one Partition in, one partition out. However, a repartitioning or shuffle/sort can happen in between tasks.

  1. The number of the Spark tasks equal to the number of the Spark partitions?

Same as (1.)

(¹) Assumption is that within your tasks, you are not multithreading yourself (never do that, otherwise core estimate will be off).

(²) Note that due to hyper-threading, you might have more than one virtual core per physical core, and thus you can have several threads per core. You might even be able to handle multiple threads (2 to 3) on a single core without hyper-threading.

like image 186
YoYo Avatar answered Sep 19 '22 04:09

YoYo


Partitions are a feature of RDD and are only available at design time (before an action is called).

Tasks are part of TaskSet per Stage per ActiveJob in a Spark application.

Is the number of the Spark tasks equal to the number of the Spark partitions?

Yes.

The executor runs once (batch inside of executor) is equal to one task?

That recursively uses "executor" and does not make much sense to me.

Every task produce only a partition?

Almost.

Every task produce an output of executing the code (it was created for) for the data in a partition.

The number of the Spark tasks equal to the number of the Spark partitions?

Almost.

The number of the Spark tasks in a single stage equals to the number of RDD partitions.

like image 22
Jacek Laskowski Avatar answered Sep 21 '22 04:09

Jacek Laskowski