Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Launching Apache Spark SQL jobs from multi-threaded driver

I was wanting to pull data from about 1500 remote Oracle tables with Spark, and I want to have a multi-threaded application that picks up a table per thread or maybe 10 tables per thread and launches a spark job to read from their respective tables.

From official spark site https://spark.apache.org/docs/latest/job-scheduling.html it's clear that this can work...

...cluster managers that Spark runs on provide facilities for scheduling across applications. Second, within each Spark application, multiple “jobs” (Spark actions) may be running concurrently if they were submitted by different threads. This is common if your application is serving requests over the network. Spark includes a fair scheduler to schedule resources within each SparkContext.

However you might have noticed in this SO post Concurrent job Execution in Spark that there was no accepted answer on this similar question and the most upvoted answer starts with

This is not really in the spirit of Spark

  1. Everyone knows it's not in the "spirit" of Spark
  2. Who cares what is the spirit of Spark? That doesn't actually mean anything

Has anyone gotten something like this to work before? Did you have to do anything special? Just wanted some pointers before I wasted a lot of work hours prototyping. I would really appreciate any help on this!

like image 796
uh_big_mike_boi Avatar asked Nov 29 '22 22:11

uh_big_mike_boi


2 Answers

The spark context is thread safe, so it's possible to call it from many threads in parallel. (I am doing it in production)

One thing to be aware of, is to limit the number of thread you have running, because:
1. the executor memory is shared between all threads, and you might get OOM or constantly swap in and out memory from the cache
2. the cpu is limited, so having more tasks than core won't have any improvement

like image 163
lev Avatar answered Dec 05 '22 06:12

lev


You do not need to submit your jobs in one multithreaded application (although I do see no reason you could not do so). Just submit your jobs as individual processes. Have a script that submits all those jobs one at a time and push the process to the background, or submit in yarn-cluster mode. Your scheduler (yarn, mesos, spark cluster), will only let some of your jobs wait as it has no room for all the schedulers to run at the same time based on memory and / or cpu availability.

Note that I only see benefit to your approach if you truly process your tables using multiple partitions - not just one as I have seen many times. Also because you need to process that many tables, I am not sure how much - if any at all - you will benefit. It might be simpler, depending on what you do with the table data, to have just multiple single thread and non-spark jobs running.

Also see @cowbert his note.

like image 23
YoYo Avatar answered Dec 05 '22 04:12

YoYo