Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark: cache RDD to be used in another job

I'm running a spark job that takes an input, which is generated by the same previous job. Right now the job outputs the results to HDFS for the next run to read in, is there a way to cache the output of each job in spark so that the following run won't have to read from HDFS?

Update: or is it possible for spark to share RDD among different applications?

like image 942
elgoog Avatar asked Dec 15 '22 07:12

elgoog


2 Answers

Directly you can't achieve this. However there are few solutions out there that will help you.

As @morfious902002 mentioned you can use Alluxio(but you'll need to install it on your cluster) which provides kind of layered storage(memory/hdfs/s3).

Another options would be to use spark-jobserver or alikes which hold same spark context and you submit your jobs to this server via REST api. Since all jobs will be executed under same long living context you'll be able to share RDD between jobs.

like image 194
Igor Berman Avatar answered Jan 18 '23 09:01

Igor Berman


EDIT : Outdated

No, it's not possible to share RDD between applications.

You'll have to persist it on disk or in a database.

like image 36
3 revs Avatar answered Jan 18 '23 09:01

3 revs