Does spark cache rdds automatically?

Question

I'm submitting a spark job (spark-submit).

Problem

I'm loading an rdd by reading avro files from HDFS.
Then I filter the rdd & count it (job-1).
Then I filter it again using a different criteria and count it (job-2).

In the logs I see that the FileInputFormat is reading 60 files the first time. But it doesn't read any files the second time.
Also when I do rdd.toDebugString I do not see the parent rdd being cached.

Details

Here is the code:

JavaRdd<Record> records = loadAllRecords();
JavaRDD<Record> type1Recs = records.filter(selectType1());
JavaRDD<Record> type2Recs = records.filter(selectType2());
log.info(type1Recs.count());
log.info(type2Recs.count());

When I look at the rdd debug info for the first count:

  .....
  .....
  |   MapPartitionsRDD[2] at filter at xxxx.java:61 []
  |   NewHadoopRDD[0] at newAPIHadoopRDD at xxxxx.java:64 []

When I look at the rdd debug info for the second count:

  .....
  .....
  |   MapPartitionsRDD[5] at filter at EventRepo.java:61 []
  |   NewHadoopRDD[0] at newAPIHadoopRDD at xxxxx.java:64 []

If I were catching NewHadoopRDD would have some catching info associated to it in the debug string...

However, I do realize that in both instances the RDD is referred to as NewHadoopRDD[0]. What is the [0] mean in this context is that the id? I think of RDD has a handle so I'm not sure the what the significance of the reusing the same handle would be?

When I do the first count I see in the logs:

FileInputFormat: Total input paths to process : 60

But I do not see a similar log for the second count. Shouldn't the records Rdd be loaded all over again?

Finally the second count is faster than the first which leads me to believe the data is in memory...

axiom · Accepted Answer

Does spark cache rdds automatically?

Sometimes, yes. The RDDs are cached automatically in cases of a shuffle.

You might have observed "skipped stages" in the spark Web UI, for example.

See: https://spark.apache.org/docs/1.5.0/programming-guide.html#shuffle-operations

In the other cases, you will need to call rdd.cache or its variants.

Does spark cache rdds automatically?

Tags:

apache-spark

hadoop

hadoop-yarn

hba

1 Answers

axiom

Recent Activity

Donate For Us

Does spark cache rdds automatically?

Tags:

apache-spark

hadoop

hadoop-yarn

hba

1 Answers

axiom

Related questions

Recent Activity

Donate For Us