I'm submitting a spark job (spark-submit).
Problem
I'm loading an rdd by reading avro files from HDFS.
Then I filter the rdd & count it (job-1).
Then I filter it again using a different criteria and count it (job-2).
rdd.toDebugString I do not see the parent rdd being cached.Details
Here is the code:
JavaRdd<Record> records = loadAllRecords();
JavaRDD<Record> type1Recs = records.filter(selectType1());
JavaRDD<Record> type2Recs = records.filter(selectType2());
log.info(type1Recs.count());
log.info(type2Recs.count());
When I look at the rdd debug info for the first count:
.....
.....
| MapPartitionsRDD[2] at filter at xxxx.java:61 []
| NewHadoopRDD[0] at newAPIHadoopRDD at xxxxx.java:64 []
When I look at the rdd debug info for the second count:
.....
.....
| MapPartitionsRDD[5] at filter at EventRepo.java:61 []
| NewHadoopRDD[0] at newAPIHadoopRDD at xxxxx.java:64 []
If I were catching NewHadoopRDD would have some catching info associated to it in the debug string...
However, I do realize that in both instances the RDD is referred to as NewHadoopRDD[0]. What is the [0] mean in this context is that the id? I think of RDD has a handle so I'm not sure the what the significance of the reusing the same handle would be?
When I do the first count I see in the logs:
FileInputFormat: Total input paths to process : 60
But I do not see a similar log for the second count. Shouldn't the records Rdd be loaded all over again?
Finally the second count is faster than the first which leads me to believe the data is in memory...
Does spark cache rdds automatically?
Sometimes, yes. The RDDs are cached automatically in cases of a shuffle.
You might have observed "skipped stages" in the spark Web UI, for example.
See: https://spark.apache.org/docs/1.5.0/programming-guide.html#shuffle-operations
In the other cases, you will need to call rdd.cache or its variants.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With