How to read and write multiple tables in parallel in Spark?

Tags:

In my Spark application, I am trying to read multiple tables from RDBMS, doing some data processing, then write multiple tables to another RDBMS as follows (in Scala):

val reading1 = sqlContext.load("jdbc", Map("url" -> myurl1, "dbtable" -> mytable1))
val reading2 = sqlContext.load("jdbc", Map("url" -> myurl1, "dbtable" -> mytable2))
val reading3 = sqlContext.load("jdbc", Map("url" -> myurl1, "dbtable" -> mytable3))

// data processing
// ..............

myDF1.write.mode("append").jdbc(myurl2, outtable1, new java.util.Properties)
myDF2.write.mode("append").jdbc(myurl2, outtable2, new java.util.Properties)
myDF3.write.mode("append").jdbc(myurl2, outtable3, new java.util.Properties)

I understand that reading from one table can be paralleled using partitions. However, the read operations of reading1, reading2, reading3 seem sequential, so do the write operations of myDF1, myDF2, myDF3.

How can I read from the multiple tables (mytable1, mytable2, mytable3) in parallel? and also write to multiple tables in parallel (I think same logic)?

798

asked Aug 24 '15 22:08

wdz

1 Answers

You can schedule mode to be FAIR, it should run the tasks in parallel. https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application

Scheduling Within an Application Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).

By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.

Starting in Spark 0.8, it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.

val conf = new SparkConf().setMaster(...).setAppName(...)
conf.set("spark.scheduler.mode", "FAIR")
val sc = new SparkContext(conf)

200

answered Oct 27 '22 00:10

Dmitry

Related questions
                            
                                Create a new task that runs a program
                            
                                Inferring result type in continuations
                            
                                How to check which parameters of case class have default value using scala reflection 2.10
                            
                                SBT common build settings
                            
                                What SQL access layer to use for simple reading in Play-Scala?
                            
                                Disposable Resource Pattern
                            
                                How to merge documentation from SBT multi-projects?
                            
                                Syntax highlighting for *.sbt files in IntelliJ IDEA
                            
                                Regex to match Java String
                            
                                Is there a simple way of defaulting out of bounds in nested Seqs in Scala?
                            
                                Handling case classes in twitter chill (Scala interface to Kryo)?
                            
                                Higher order operations with flattened tuples in scala
                            
                                play framework - error parsing expression in build.sbt
                            
                                Map Shapeless hlist type F[T1] :: ... :: F[Tn] :: HNil to the type T1 :: ... :: Tn :: HNil (type level sequencing)
                            
                                sbt plugin dynamically load user defined code?
                            
                                Spark application finished callback
                            
                                scala.reflect.internal.FatalError: package scala does not have a member Int
                            
                                Implicit conversions for defs/lambdas in Scala?
                            
                                Scala -- mutually exclusive traits
                            
                                Using Spring as a dependency injection framework with play 2.4.x?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to read and write multiple tables in parallel in Spark?

Tags:

parallel-processing

scala

apache-spark

apache-spark-sql

wdz

People also ask

1 Answers

Dmitry

Recent Activity

Donate For Us