Is my understanding right?
Application: one spark submit.
job: once a lazy evaluation happens, there is a job.
stage: It is related to the shuffle and the transformation type. It is hard for me to understand the boundary of the stage.
task: It is unit operation. One transformation per task. One task per transformation.
Help wanted to improve this understanding.
Summary. A Spark application can have many jobs. A job can have many stages. A stage can have many tasks. A task executes a series of instructions.
In Spark, a Task (aka command) is the smallest individual unit of execution that corresponds to a RDD partition. Tasks are launched on executors. Figure 1. Tasks correspond to partitions in RDD. In other (more technical) words, a task is a computation on a data partition in a stage of a RDD in a Spark job.
Spark Application. The Spark application is a self-contained computation that runs user-supplied code to compute a result. A Spark application can have processes running on its behalf even when it's not running a job.
Executors in Spark are the worker nodes that help in running individual tasks by being in charge of a given spark job. These are launched at the beginning of Spark applications, and as soon as the task is run, results are immediately sent to the driver.
The main function is the application.
When you invoke an action on an RDD, a "job" is created. Jobs are work submitted to Spark.
Jobs are divided into "stages" based on the shuffle boundary. This can help you understand.
Each stage is further divided into tasks based on the number of partitions in the RDD. So tasks are the smallest units of work for Spark.
Application - A user program built on Spark using its APIs. It consists of a driver program and executors on the cluster.
Job - A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g., save(), collect()). During interactive sessions with Spark shells, the driver converts your Spark application into one or more Spark jobs. It then transforms each job into a DAG. This, in essence, is Spark’s execution plan, where each node within a DAG could be a single or multiple Spark stages.
Stage - Each job gets divided into smaller sets of tasks called stages that depend on each other. As part of the DAG nodes, stages are created based on what operations can be performed serially or in parallel. Not all Spark operations can happen in a single stage, so they may be divided into multiple stages. Often stages are delineated on the operator’s computation boundaries, where they dictate data transfer among Spark executors.
Task - A single unit of work or execution that will be sent to a Spark executor. Each stage is comprised of Spark tasks (a unit of execution), which are then federated across each Spark executor; each task maps to a single core and works on a single partition of data. As such, an executor with 16 cores can have 16 or more tasks working on 16 or more partitions in parallel, making the execution of Spark’s tasks exceedingly parallel!
Disclaimer: Content copied from: Learning Spark
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With