I'm running a spark job that takes an input, which is generated by the same previous job. Right now the job outputs the results to HDFS for the next run to read in, is there a way to cache the output of each job in spark so that the following run won't have to read from HDFS?
Update: or is it possible for spark to share RDD among different applications?
Directly you can't achieve this. However there are few solutions out there that will help you.
As @morfious902002 mentioned you can use Alluxio(but you'll need to install it on your cluster) which provides kind of layered storage(memory/hdfs/s3).
Another options would be to use spark-jobserver or alikes which hold same spark context and you submit your jobs to this server via REST api. Since all jobs will be executed under same long living context you'll be able to share RDD between jobs.
EDIT : Outdated
No, it's not possible to share RDD between applications.
You'll have to persist it on disk or in a database.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With