Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to run two spark jobs in parallel in standalone mode [duplicate]

I have spark job in which I process a file and then do following steps.

1. Load the file into DataFrame
2. Push the DataFrame to elasticsearch
3. Run some aggregations on dataframe and save to cassandra

I have written a spark job for this in which I have following function calls

writeToES(df)
writeToCassandra(df)

Now these two operations run one by one. However these two can run in parallel.

How can I do this in a single spark job.

I can make two spark jobs each for writing to ES and Cassandra. But they will use multiple ports, which I want to avoid.

like image 943
hard coder Avatar asked Apr 04 '18 08:04

hard coder


People also ask

How do I run multiple Spark jobs in parallel?

Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action.

Can Spark stages run in parallel?

A Spark stage can be understood as a compute block to compute data partitions of a distributed collection, the compute block being able to execute in parallel in a cluster of computing nodes. Spark builds parallel execution flow for a Spark application using single or multiple stages.

How do I run a Spark job in standalone mode?

To install Spark Standalone mode, you simply place a compiled version of Spark on each node on the cluster. You can obtain pre-built versions of Spark with each release or build it yourself.

How do I run multiple Spark contexts?

If you have a necessity to work with lots of Spark contexts, you can turn on special option [MultipleContexts] (1) , but it is used only for Spark internal tests and is not supposed to be used in user programs. You will get unexpected behavior while running more than one Spark context in a single JVM [SPARK-2243] (2).


1 Answers

You cannot run these two actions through the same spark job. What you're surely looking for is running these two jobs in parallel in the same application.

As the documentation says, you can run multiple jobs in parallel in the same application if those jobs are submitted from different threads:

Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).

In other words, this should run both actions in parallel (using completable future API here, but you can use any async execution or multithreading mechanism):

CompletableFuture.runAsync(() -> writeToES(df));
CompletableFuture.runAsync(() -> writeToCassandra(df));

You can then join on one or both of these two to wait for completion. As noted in the documentation, you need to pay attention to the configured scheduler mode. Using the FAIR scheduler allows you to run the above in parallel:

conf.set("spark.scheduler.mode", "FAIR")
like image 129
ernest_k Avatar answered Sep 28 '22 06:09

ernest_k