I have spark job in which I process a file and then do following steps.
1. Load the file into DataFrame
2. Push the DataFrame to elasticsearch
3. Run some aggregations on dataframe and save to cassandra
I have written a spark job for this in which I have following function calls
writeToES(df)
writeToCassandra(df)
Now these two operations run one by one. However these two can run in parallel.
How can I do this in a single spark job.
I can make two spark jobs each for writing to ES and Cassandra. But they will use multiple ports, which I want to avoid.
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action.
A Spark stage can be understood as a compute block to compute data partitions of a distributed collection, the compute block being able to execute in parallel in a cluster of computing nodes. Spark builds parallel execution flow for a Spark application using single or multiple stages.
To install Spark Standalone mode, you simply place a compiled version of Spark on each node on the cluster. You can obtain pre-built versions of Spark with each release or build it yourself.
If you have a necessity to work with lots of Spark contexts, you can turn on special option [MultipleContexts] (1) , but it is used only for Spark internal tests and is not supposed to be used in user programs. You will get unexpected behavior while running more than one Spark context in a single JVM [SPARK-2243] (2).
You cannot run these two actions through the same spark job. What you're surely looking for is running these two jobs in parallel in the same application.
As the documentation says, you can run multiple jobs in parallel in the same application if those jobs are submitted from different threads:
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).
In other words, this should run both actions in parallel (using completable future API here, but you can use any async execution or multithreading mechanism):
CompletableFuture.runAsync(() -> writeToES(df));
CompletableFuture.runAsync(() -> writeToCassandra(df));
You can then join on one or both of these two to wait for completion. As noted in the documentation, you need to pay attention to the configured scheduler mode. Using the FAIR scheduler allows you to run the above in parallel:
conf.set("spark.scheduler.mode", "FAIR")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With