I have already done with spark installation and executed few testcases setting master and worker nodes. That said, I have a very fat confusion of what exactly a job is meant in Spark context(not SparkContext). I have below questions <ul> <li>How different is job from a Driver program.</li> <li>Application itself is a part of Driver program?</li> <li>Spark submit in a way is a job?</li> </ul> I read the Spark documention but still this thing is not clear for me. Having said, my implementation is to write spark jobs{programmatically} which would to a spark-submit. Kindly help with some example if possible . It would be very helpdful. Note: Kindly do not post spark links because I have already tried it. Even though the questions sounds naive but still I need more clarity in understanding.

Well, terminology can always be difficult since it depends on context. In many cases, you can be used to "submit a job to a cluster", which for spark would be to submit a driver program. That said, Spark has his own definition for "job", directly from the glossary: <blockquote> Job A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you'll see this term used in the driver's logs. </blockquote> So I this context, let's say you need to do the following: <ol> <li>Load a file with people names and addresses into RDD1</li> <li>Load a file with people names and phones into RDD2</li> <li>Join RDD1 and RDD2 by name, to get RDD3</li> <li>Map on RDD3 to get a nice HTML presentation card for each person as RDD4</li> <li>Save RDD4 to file.</li> <li>Map RDD1 to extract zipcodes from the addresses to get RDD5</li> <li>Aggregate on RDD5 to get a count of how many people live on each zipcode as RDD6</li> <li>Collect RDD6 and prints these stats to the stdout.</li> </ol> So, <ol> <li>The driver program is this entire piece of code, running all 8 steps.</li> <li>Producing the entire HTML card set on step 5 is a job (clear because we are using the save action, not a transformation). Same with the collect on step 8</li> <li>Other steps will be organized into stages, with each job being the result of a sequence of stages. For simple things a job can have a single stage, but the need to repartition data (for instance, the join on step 3) or anything that breaks the locality of the data usually causes more stages to appear. You can think of stages as computations that produce intermediate results, which can in fact be persisted. For instance, we can persist RDD1 since we'll be using it more than once, avoiding recomputation.</li> <li>All 3 above basically talk about how the logic of a given algorithm will be broken. In contrast, a task is a particular piece of data that will go through a given stage, on a given executor.</li> </ol> Hope it makes things clearer ;-)

What is Spark Job ?

Tags:

apache-spark

batch-processing

job-scheduling

I have already done with spark installation and executed few testcases setting master and worker nodes. That said, I have a very fat confusion of what exactly a job is meant in Spark context(not SparkContext). I have below questions

How different is job from a Driver program.
Application itself is a part of Driver program?
Spark submit in a way is a job?

I read the Spark documention but still this thing is not clear for me.

Having said, my implementation is to write spark jobs{programmatically} which would to a spark-submit.

Kindly help with some example if possible . It would be very helpdful.

Note: Kindly do not post spark links because I have already tried it. Even though the questions sounds naive but still I need more clarity in understanding.

495

asked Mar 10 '15 20:03

chaosguru

1 Answers

Well, terminology can always be difficult since it depends on context. In many cases, you can be used to "submit a job to a cluster", which for spark would be to submit a driver program.

That said, Spark has his own definition for "job", directly from the glossary:

Job A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you'll see this term used in the driver's logs.

So I this context, let's say you need to do the following:

Load a file with people names and addresses into RDD1
Load a file with people names and phones into RDD2
Join RDD1 and RDD2 by name, to get RDD3
Map on RDD3 to get a nice HTML presentation card for each person as RDD4
Save RDD4 to file.
Map RDD1 to extract zipcodes from the addresses to get RDD5
Aggregate on RDD5 to get a count of how many people live on each zipcode as RDD6
Collect RDD6 and prints these stats to the stdout.

So,

The driver program is this entire piece of code, running all 8 steps.
Producing the entire HTML card set on step 5 is a job (clear because we are using the save action, not a transformation). Same with the collect on step 8
Other steps will be organized into stages, with each job being the result of a sequence of stages. For simple things a job can have a single stage, but the need to repartition data (for instance, the join on step 3) or anything that breaks the locality of the data usually causes more stages to appear. You can think of stages as computations that produce intermediate results, which can in fact be persisted. For instance, we can persist RDD1 since we'll be using it more than once, avoiding recomputation.
All 3 above basically talk about how the logic of a given algorithm will be broken. In contrast, a task is a particular piece of data that will go through a given stage, on a given executor.

Hope it makes things clearer ;-)

120

answered Nov 12 '22 04:11

Daniel Langdon

Related questions
                            
                                How to partition and write DataFrame in Spark without deleting partitions with no new data?
                            
                                What is spark.driver.maxResultSize?
                            
                                Spark RDD - Mapping with extra arguments
                            
                                How do I install pyspark for use in standalone scripts?
                            
                                Spark Scala list folders in directory
                            
                                Multiple Aggregate operations on the same column of a spark dataframe
                            
                                DataFrame-ified zipWithIndex
                            
                                multiple conditions for filter in spark data frames
                            
                                Filter Spark DataFrame based on another DataFrame that specifies denylist criteria
                            
                                Transpose column to row with Spark
                            
                                How to write spark streaming DF to Kafka topic
                            
                                How to add third-party Java JAR files for use in PySpark
                            
                                How to integrate Apache Spark with MySQL for reading database tables as a spark dataframe? [closed]
                            
                                Filtering a pyspark dataframe using isin by exclusion [duplicate]
                            
                                Spark - How to write a single csv file WITHOUT folder?
                            
                                Mind blown: RDD.zip() method
                            
                                Spark add new column to dataframe with value from previous row
                            
                                Writing SQL vs using Dataframe APIs in Spark SQL
                            
                                How to work efficiently with SBT, Spark and "provided" dependencies?
                            
                                Apache Spark does not delete temporary directories

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With