I was wanting to pull data from about 1500 remote Oracle tables with Spark, and I want to have a multi-threaded application that picks up a table per thread or maybe 10 tables per thread and launches a spark job to read from their respective tables.
From official spark site https://spark.apache.org/docs/latest/job-scheduling.html it's clear that this can work...
...cluster managers that Spark runs on provide facilities for scheduling across applications. Second, within each Spark application, multiple “jobs” (Spark actions) may be running concurrently if they were submitted by different threads. This is common if your application is serving requests over the network. Spark includes a fair scheduler to schedule resources within each SparkContext.
However you might have noticed in this SO post Concurrent job Execution in Spark that there was no accepted answer on this similar question and the most upvoted answer starts with
This is not really in the spirit of Spark
Has anyone gotten something like this to work before? Did you have to do anything special? Just wanted some pointers before I wasted a lot of work hours prototyping. I would really appreciate any help on this!
The spark context is thread safe, so it's possible to call it from many threads in parallel. (I am doing it in production)
One thing to be aware of, is to limit the number of thread you have running, because:
1. the executor memory is shared between all threads, and you might get OOM or constantly swap in and out memory from the cache
2. the cpu is limited, so having more tasks than core won't have any improvement
You do not need to submit your jobs in one multithreaded application (although I do see no reason you could not do so). Just submit your jobs as individual processes. Have a script that submits all those jobs one at a time and push the process to the background, or submit in yarn-cluster mode. Your scheduler (yarn, mesos, spark cluster), will only let some of your jobs wait as it has no room for all the schedulers to run at the same time based on memory and / or cpu availability.
Note that I only see benefit to your approach if you truly process your tables using multiple partitions - not just one as I have seen many times. Also because you need to process that many tables, I am not sure how much - if any at all - you will benefit. It might be simpler, depending on what you do with the table data, to have just multiple single thread and non-spark jobs running.
Also see @cowbert his note.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With