Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to share Spark RDD between 2 Spark contexts?

I have an RMI cluster. Each RMI server has a Spark context. Is there any way to share an RDD between different Spark contexts?

like image 342
simafengyun Avatar asked Jan 13 '15 08:01

simafengyun


2 Answers

As already stated by Daniel Darabos it is not possible. Every distributed object in Spark is bounded to specific context which has been used to create it (SparkContext in case of RDD, SQLContext in case of DataFrame dataset). If you want share objects between applications you have to use shared contexts (see for example spark-jobserver, Livy, or Apache Zeppelin). Since RDD or DataFrame is just a small local object there is really not much to share.

Sharing data is a completely different problem. You can use specialized in memory cache (Apache Ignite) or distributed in memory file systems (like Alluxio - former Tachyon) to minimize the latency when switching between application but you cannot really avoid it.

like image 173
zero323 Avatar answered Sep 21 '22 05:09

zero323


No, an RDD is tied to a single SparkContext. The general idea is that you have a Spark cluster and one driver program that tells the cluster what to do. This driver would have the SparkContext and kick off operations on the RDDs.

If you want to just move an RDD from one driver program to another, the solution is to write it to disk (S3/HDFS/...) in the first driver and load it from disk in the other driver.

like image 36
Daniel Darabos Avatar answered Sep 23 '22 05:09

Daniel Darabos