Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the concept of application, job, stage and task in spark?

Tags:

apache-spark

Is my understanding right?

  1. Application: one spark submit.

  2. job: once a lazy evaluation happens, there is a job.

  3. stage: It is related to the shuffle and the transformation type. It is hard for me to understand the boundary of the stage.

  4. task: It is unit operation. One transformation per task. One task per transformation.

Help wanted to improve this understanding.

like image 987
cdhit Avatar asked Feb 16 '17 01:02

cdhit


People also ask

What is stage task and job in Spark?

Summary. A Spark application can have many jobs. A job can have many stages. A stage can have many tasks. A task executes a series of instructions.

What is a task in Spark?

In Spark, a Task (aka command) is the smallest individual unit of execution that corresponds to a RDD partition. Tasks are launched on executors. Figure 1. Tasks correspond to partitions in RDD. In other (more technical) words, a task is a computation on a data partition in a stage of a RDD in a Spark job.

What is an application in Spark?

Spark Application. The Spark application is a self-contained computation that runs user-supplied code to compute a result. A Spark application can have processes running on its behalf even when it's not running a job.

What is executor and task in Spark?

Executors in Spark are the worker nodes that help in running individual tasks by being in charge of a given spark job. These are launched at the beginning of Spark applications, and as soon as the task is run, results are immediately sent to the driver.


2 Answers

The main function is the application.

When you invoke an action on an RDD, a "job" is created. Jobs are work submitted to Spark.

Jobs are divided into "stages" based on the shuffle boundary. This can help you understand.

Each stage is further divided into tasks based on the number of partitions in the RDD. So tasks are the smallest units of work for Spark.

like image 63
rakesh Avatar answered Sep 28 '22 01:09

rakesh


Application - A user program built on Spark using its APIs. It consists of a driver program and executors on the cluster.

Job - A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g., save(), collect()). During interactive sessions with Spark shells, the driver converts your Spark application into one or more Spark jobs. It then transforms each job into a DAG. This, in essence, is Spark’s execution plan, where each node within a DAG could be a single or multiple Spark stages.

Stage - Each job gets divided into smaller sets of tasks called stages that depend on each other. As part of the DAG nodes, stages are created based on what operations can be performed serially or in parallel. Not all Spark operations can happen in a single stage, so they may be divided into multiple stages. Often stages are delineated on the operator’s computation boundaries, where they dictate data transfer among Spark executors.

Task - A single unit of work or execution that will be sent to a Spark executor. Each stage is comprised of Spark tasks (a unit of execution), which are then federated across each Spark executor; each task maps to a single core and works on a single partition of data. As such, an executor with 16 cores can have 16 or more tasks working on 16 or more partitions in parallel, making the execution of Spark’s tasks exceedingly parallel! Spark stage creating one or more tasks to be distributed to executors

Disclaimer: Content copied from: Learning Spark

like image 36
venus Avatar answered Sep 28 '22 02:09

venus