Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does spark cache rdds automatically?

I'm submitting a spark job (spark-submit).

Problem

I'm loading an rdd by reading avro files from HDFS.
Then I filter the rdd & count it (job-1).
Then I filter it again using a different criteria and count it (job-2).

  • In the logs I see that the FileInputFormat is reading 60 files the first time. But it doesn't read any files the second time.
  • Also when I do rdd.toDebugString I do not see the parent rdd being cached.

Details

Here is the code:

JavaRdd<Record> records = loadAllRecords();
JavaRDD<Record> type1Recs = records.filter(selectType1());
JavaRDD<Record> type2Recs = records.filter(selectType2());
log.info(type1Recs.count());
log.info(type2Recs.count());

When I look at the rdd debug info for the first count:

  .....
  .....
  |   MapPartitionsRDD[2] at filter at xxxx.java:61 []
  |   NewHadoopRDD[0] at newAPIHadoopRDD at xxxxx.java:64 []

When I look at the rdd debug info for the second count:

  .....
  .....
  |   MapPartitionsRDD[5] at filter at EventRepo.java:61 []
  |   NewHadoopRDD[0] at newAPIHadoopRDD at xxxxx.java:64 [] 

If I were catching NewHadoopRDD would have some catching info associated to it in the debug string...

However, I do realize that in both instances the RDD is referred to as NewHadoopRDD[0]. What is the [0] mean in this context is that the id? I think of RDD has a handle so I'm not sure the what the significance of the reusing the same handle would be?

When I do the first count I see in the logs:

FileInputFormat: Total input paths to process : 60

But I do not see a similar log for the second count. Shouldn't the records Rdd be loaded all over again?

Finally the second count is faster than the first which leads me to believe the data is in memory...

like image 369
hba Avatar asked Apr 13 '26 04:04

hba


1 Answers

Does spark cache rdds automatically?

Sometimes, yes. The RDDs are cached automatically in cases of a shuffle.

You might have observed "skipped stages" in the spark Web UI, for example.

See: https://spark.apache.org/docs/1.5.0/programming-guide.html#shuffle-operations

In the other cases, you will need to call rdd.cache or its variants.

like image 122
axiom Avatar answered Apr 16 '26 00:04

axiom