Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read and write multiple tables in parallel in Spark?

In my Spark application, I am trying to read multiple tables from RDBMS, doing some data processing, then write multiple tables to another RDBMS as follows (in Scala):

val reading1 = sqlContext.load("jdbc", Map("url" -> myurl1, "dbtable" -> mytable1))
val reading2 = sqlContext.load("jdbc", Map("url" -> myurl1, "dbtable" -> mytable2))
val reading3 = sqlContext.load("jdbc", Map("url" -> myurl1, "dbtable" -> mytable3))

// data processing
// ..............

myDF1.write.mode("append").jdbc(myurl2, outtable1, new java.util.Properties)
myDF2.write.mode("append").jdbc(myurl2, outtable2, new java.util.Properties)
myDF3.write.mode("append").jdbc(myurl2, outtable3, new java.util.Properties)

I understand that reading from one table can be paralleled using partitions. However, the read operations of reading1, reading2, reading3 seem sequential, so do the write operations of myDF1, myDF2, myDF3.

How can I read from the multiple tables (mytable1, mytable2, mytable3) in parallel? and also write to multiple tables in parallel (I think same logic)?

like image 798
wdz Avatar asked Aug 24 '15 22:08

wdz


People also ask

How do I run multiple Spark jobs in parallel?

Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save , collect ) and any tasks that need to run to evaluate that action.

Does Spark read in parallel?

Spark SQL will read different column family in parallel. By default, for example, in this example, there are six columns for the Parquet file. By default, all the six columns will be in a single Parquet file.

Does Spark write in parallel?

By default, when Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task. Sometimes, a variable needs to be shared across tasks, or between tasks and the driver program.


1 Answers

You can schedule mode to be FAIR, it should run the tasks in parallel. https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application

Scheduling Within an Application Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).

By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.

Starting in Spark 0.8, it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.

val conf = new SparkConf().setMaster(...).setAppName(...)
conf.set("spark.scheduler.mode", "FAIR")
val sc = new SparkContext(conf)
like image 200
Dmitry Avatar answered Oct 27 '22 00:10

Dmitry