I make an RDD using flatMap. Later on I perform range partitioning of it. If I persist the original RDD, everything works fine. However, If I don't cache it, the range partitioner part somehow wants to recalculate parts of the original RDD. I understand if I don't have enough memory, but in this case, there is much more memory in my system than what the RDD occupies. Secondly, the computations for that RDD are long, so this restarting/recomputing really hurts the performance. What could be the reason for this strange behavior?
P.S I use the RDD only once. So, this should not happen.
This is just how Spark works:
When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it).
So when you don't, it doesn't. If you use an RDD more than once, and have enough memory, you generally want to persist it.
This can't be done automatically because Spark can't know if you are going to reuse the RDD: e.g. you can calculate an RDD, then sample
it, and use the results to decide whether you want to do something else with the RDD, so whether RDD is used twice depends on random number generator.
If you didn't use RDD.cache, the RDD computing result would not be persist in memory.For example(there is a rdd data rdd_test)
val rdd_test: RDD[Int] = sc.makeRDD(Array(1,2,3), 1)
val a = rdd_test.map(_+1)
val b = a.map(_+1)
Now, a
and b
these three rdd data are not in memory. So, when val c = b.map(_+1)
, a
and b
will be recomputed.
if we use cache on a and b:
val rdd_test: RDD[Int] = sc.makeRDD(Array(1,2,3), 1)
val a = rdd_test.map(_+1).cache
val b = a.map(_+1).cache
Then val c = b.map(_+1)
, a
and b
will not be recomputed.
(Please note that: if there is not enough memory, cache
method will fail, so a
and b
will be recompute.
I'm not good at english, sorry.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With