writing 2 data frames in parallel in scala

Question

For example I am doing a lot of calculations and I am finally down to 3 dataframes.

for example:

val mainQ = spark.sql("select * from employee")
mainQ.createOrReplaceTempView("mainQ")
val mainQ1 = spark.sql("select state,count(1) from mainQ group by state")
val mainQ2 = spark.sql("select dept_id,sum(salary) from mainQ group by dept_id")
val mainQ3 = spark.sql("select  dept_id,state , sum(salary) from mainQ     group by dept_id,state")
//Basically I want to write below writes in parallel. I could put into 
//Different files. But that is not what I am looking at. Once all         computation is done. I want to write the data in parallel.
mainQ1.write.mode("overwrite").save("/user/h/mainQ1.txt")
mainQ2.write.mode("overwrite").save("/user/h/mainQ2.txt")
mainQ3.write.mode("overwrite").save("/user/h/mainQ3.txt")

Raphael Roth · Accepted Answer

Normally there is no benefit using multi-threading in the driver code, but sometimes it can increase performance. I had some situations where launching parallel spark jobs increased performance drastically, namely when the individual jobs do not utilize the cluster resources well (e.g. due to data skew, too few partitions etc). In your case you can do:

ParSeq(
  (mainQ1,"/user/h/mainQ1.txt"),
  (mainQ2,"/user/h/mainQ2.txt"),
  (mainQ3,"/user/h/mainQ3.txt")
).foreach{case (df,filename) => 
  df.write.mode("overwrite").save(filename)
}

writing 2 data frames in parallel in scala

Tags:

scala

apache-spark

apache-spark-sql

user3539924

1 Answers

Raphael Roth

Recent Activity

Donate For Us

writing 2 data frames in parallel in scala

Tags:

scala

apache-spark

apache-spark-sql

user3539924

1 Answers

Raphael Roth

Related questions

Recent Activity

Donate For Us