How to run multiple Spark jobs in parallel?

1 Answers

Quoting the official documentation on Job Scheduling:

Second, within each Spark application, multiple "jobs" (Spark actions) may be running concurrently if they were submitted by different threads.

In other words, a single SparkContext instance can be used by multiple threads that gives the ability to submit multiple Spark jobs that may or may not be running in parallel.

Whether the Spark jobs run in parallel depends on the number of CPUs (Spark does not track the memory usage for scheduling). If there are enough CPUs to handle the tasks from multiple Spark jobs they will be running concurrently.

If however the number of CPUs is not enough you may consider using FAIR scheduling mode (FIFO is the default):

Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).

By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.

Just to clear things up a bit.

spark-submit is to submit a Spark application for execution (not Spark jobs). A single Spark application can have at least one Spark job.
RDD actions may or may not be blocking. SparkContext comes with two methods to submit (or run) a Spark job, i.e. SparkContext.runJob and SparkContext.submitJob, and so it does not really matter whether an action is blocking or not but what SparkContext method to use to have non-blocking behaviour.

Please note that "RDD action methods" are already written and their implementations use whatever Spark developers bet on (mostly SparkContext.runJob as in count):

// RDD.count
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

You'd have to write your own RDD actions (on a custom RDD) to have required non-blocking feature in your Spark app.

163

answered Oct 14 '22 15:10

Jacek Laskowski

Related questions
                            
                                Spark submit does automatically upload the jar to cluster?
                            
                                How to create a Spark Dataset from an RDD
                            
                                How to name aggregate columns?
                            
                                Passing Arguments in Apache Spark
                            
                                extracting numpy array from Pyspark Dataframe
                            
                                Pyspark dataframe write to single json file with specific name
                            
                                How to split a dataframe into dataframes with same column values?
                            
                                Pandas-style transform of grouped data on PySpark DataFrame
                            
                                Spark: RDD to List
                            
                                `pyspark mllib` versus `pyspark ml` packages
                            
                                Apache Spark Codegen Stage grows beyond 64 KB
                            
                                Azure Databricks - Can not create the managed table The associated location already exists
                            
                                PySpark DataFrames - way to enumerate without converting to Pandas?
                            
                                What will spark do if I don't have enough memory?
                            
                                Replacing null values with 0 after spark dataframe left outer join
                            
                                Spark Scala: DateDiff of two columns by hour or minute
                            
                                PySpark Throwing error Method __getnewargs__([]) does not exist
                            
                                How to remove nulls with array_remove Spark SQL Built-in Function
                            
                                What factors decide the number of executors in a stand alone mode?
                            
                                AbstractMethodError creating Kafka stream

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to run multiple Spark jobs in parallel?

Tags:

apache-spark

Nagendra Palla

People also ask

1 Answers

Jacek Laskowski

Recent Activity

Donate For Us