Spark: cache RDD to be used in another job

Question

I'm running a spark job that takes an input, which is generated by the same previous job. Right now the job outputs the results to HDFS for the next run to read in, is there a way to cache the output of each job in spark so that the following run won't have to read from HDFS?

Update: or is it possible for spark to share RDD among different applications?

Igor Berman · Accepted Answer

Directly you can't achieve this. However there are few solutions out there that will help you.

As @morfious902002 mentioned you can use Alluxio(but you'll need to install it on your cluster) which provides kind of layered storage(memory/hdfs/s3).

Another options would be to use spark-jobserver or alikes which hold same spark context and you submit your jobs to this server via REST api. Since all jobs will be executed under same long living context you'll be able to share RDD between jobs.

3 revs · Answer

EDIT : Outdated

No, it's not possible to share RDD between applications.

You'll have to persist it on disk or in a database.

Spark: cache RDD to be used in another job

Tags:

apache-spark

rdd

elgoog

2 Answers

Igor Berman

3 revs

Recent Activity

Donate For Us

Spark: cache RDD to be used in another job

Tags:

apache-spark

rdd

elgoog

2 Answers

Igor Berman

3 revs

Related questions

Recent Activity

Donate For Us