What is the concept of application, job, stage and task in spark?

2 Answers

The main function is the application.

When you invoke an action on an RDD, a "job" is created. Jobs are work submitted to Spark.

Jobs are divided into "stages" based on the shuffle boundary. This can help you understand.

Each stage is further divided into tasks based on the number of partitions in the RDD. So tasks are the smallest units of work for Spark.

answered Sep 28 '22 01:09

rakesh

Application - A user program built on Spark using its APIs. It consists of a driver program and executors on the cluster.

Job - A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g., save(), collect()). During interactive sessions with Spark shells, the driver converts your Spark application into one or more Spark jobs. It then transforms each job into a DAG. This, in essence, is Spark’s execution plan, where each node within a DAG could be a single or multiple Spark stages.

Stage - Each job gets divided into smaller sets of tasks called stages that depend on each other. As part of the DAG nodes, stages are created based on what operations can be performed serially or in parallel. Not all Spark operations can happen in a single stage, so they may be divided into multiple stages. Often stages are delineated on the operator’s computation boundaries, where they dictate data transfer among Spark executors.

Task - A single unit of work or execution that will be sent to a Spark executor. Each stage is comprised of Spark tasks (a unit of execution), which are then federated across each Spark executor; each task maps to a single core and works on a single partition of data. As such, an executor with 16 cores can have 16 or more tasks working on 16 or more partitions in parallel, making the execution of Spark’s tasks exceedingly parallel! Spark stage creating one or more tasks to be distributed to executors

Disclaimer: Content copied from: Learning Spark

answered Sep 28 '22 02:09

venus

Related questions
                            
                                PySpark: withColumn() with two conditions and three outcomes
                            
                                How to flatten a struct in a Spark dataframe?
                            
                                Automatically and Elegantly flatten DataFrame in Spark SQL
                            
                                How to split Vector into columns - using PySpark
                            
                                aggregate function Count usage with groupBy in Spark
                            
                                What are the various join types in Spark?
                            
                                How does Spark partition(ing) work on files in HDFS?
                            
                                How to melt Spark DataFrame?
                            
                                How to check Spark Version [closed]
                            
                                Generate a Spark StructType / Schema from a case class
                            
                                Spark functions vs UDF performance?
                            
                                How to access s3a:// files from Apache Spark?
                            
                                PySpark - rename more than one column using withColumnRenamed
                            
                                How do I log from my Python Spark script
                            
                                PySpark: java.lang.OutofMemoryError: Java heap space
                            
                                Retrieve top n in each group of a DataFrame in pyspark
                            
                                PySpark: How to fillna values in dataframe for specific columns?
                            
                                How to convert a DataFrame back to normal RDD in pyspark?
                            
                                How to import multiple csv files in a single load?
                            
                                How to list all cassandra tables

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the concept of application, job, stage and task in spark?

Tags:

apache-spark

cdhit

People also ask

2 Answers

rakesh

venus

Recent Activity

Donate For Us