Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

writing 2 data frames in parallel in scala

For example I am doing a lot of calculations and I am finally down to 3 dataframes.

for example:

val mainQ = spark.sql("select * from employee")
mainQ.createOrReplaceTempView("mainQ")
val mainQ1 = spark.sql("select state,count(1) from mainQ group by state")
val mainQ2 = spark.sql("select dept_id,sum(salary) from mainQ group by dept_id")
val mainQ3 = spark.sql("select  dept_id,state , sum(salary) from mainQ     group by dept_id,state")
//Basically I want to write below writes in parallel. I could put into 
//Different files. But that is not what I am looking at. Once all         computation is done. I want to write the data in parallel.
mainQ1.write.mode("overwrite").save("/user/h/mainQ1.txt")
mainQ2.write.mode("overwrite").save("/user/h/mainQ2.txt")
mainQ3.write.mode("overwrite").save("/user/h/mainQ3.txt")
like image 346
user3539924 Avatar asked Feb 11 '26 06:02

user3539924


1 Answers

Normally there is no benefit using multi-threading in the driver code, but sometimes it can increase performance. I had some situations where launching parallel spark jobs increased performance drastically, namely when the individual jobs do not utilize the cluster resources well (e.g. due to data skew, too few partitions etc). In your case you can do:

ParSeq(
  (mainQ1,"/user/h/mainQ1.txt"),
  (mainQ2,"/user/h/mainQ2.txt"),
  (mainQ3,"/user/h/mainQ3.txt")
).foreach{case (df,filename) => 
  df.write.mode("overwrite").save(filename)
}
like image 91
Raphael Roth Avatar answered Feb 13 '26 18:02

Raphael Roth



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!