Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to run multiple Spark jobs in parallel?

Tags:

apache-spark

one spark has one oracle query. so I have to run multiple jobs in parallel so that all queries will fire at the same time.

How to run multiple jobs in parallel?

like image 791
Nagendra Palla Avatar asked Mar 30 '18 05:03

Nagendra Palla


People also ask

Can Spark run multiple jobs in parallel?

Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save , collect ) and any tasks that need to run to evaluate that action.

Can we have multiple jobs in single executor in Spark?

Hence, at a time, Spark runs multiple tasks in parallel but not multiple jobs. WARNING: It does not mean spark cannot run concurrent jobs. Through this article we will explore how we can boost our default spark application's performance by running multiple jobs(spark actions) at once.

How do you do parallel processing in Spark?

Show activity on this post. I have changed your code a bit but this is basically how you can run parallel tasks, If you have some flat files that you want to run parallel just make a list with their name and pass it into pool. map( fun,data). Change the function fun as need be.


1 Answers

Quoting the official documentation on Job Scheduling:

Second, within each Spark application, multiple "jobs" (Spark actions) may be running concurrently if they were submitted by different threads.

In other words, a single SparkContext instance can be used by multiple threads that gives the ability to submit multiple Spark jobs that may or may not be running in parallel.

Whether the Spark jobs run in parallel depends on the number of CPUs (Spark does not track the memory usage for scheduling). If there are enough CPUs to handle the tasks from multiple Spark jobs they will be running concurrently.

If however the number of CPUs is not enough you may consider using FAIR scheduling mode (FIFO is the default):

Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).

By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.


Just to clear things up a bit.

  1. spark-submit is to submit a Spark application for execution (not Spark jobs). A single Spark application can have at least one Spark job.

  2. RDD actions may or may not be blocking. SparkContext comes with two methods to submit (or run) a Spark job, i.e. SparkContext.runJob and SparkContext.submitJob, and so it does not really matter whether an action is blocking or not but what SparkContext method to use to have non-blocking behaviour.

Please note that "RDD action methods" are already written and their implementations use whatever Spark developers bet on (mostly SparkContext.runJob as in count):

// RDD.count
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

You'd have to write your own RDD actions (on a custom RDD) to have required non-blocking feature in your Spark app.

like image 163
Jacek Laskowski Avatar answered Oct 14 '22 15:10

Jacek Laskowski