Why spark keeps on recomputing an RDD?

Question

I make an RDD using flatMap. Later on I perform range partitioning of it. If I persist the original RDD, everything works fine. However, If I don't cache it, the range partitioner part somehow wants to recalculate parts of the original RDD. I understand if I don't have enough memory, but in this case, there is much more memory in my system than what the RDD occupies. Secondly, the computations for that RDD are long, so this restarting/recomputing really hurts the performance. What could be the reason for this strange behavior?

P.S I use the RDD only once. So, this should not happen.

Alexey Romanov · Accepted Answer

This is just how Spark works:

When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it).

So when you don't, it doesn't. If you use an RDD more than once, and have enough memory, you generally want to persist it.

This can't be done automatically because Spark can't know if you are going to reuse the RDD: e.g. you can calculate an RDD, then sample it, and use the results to decide whether you want to do something else with the RDD, so whether RDD is used twice depends on random number generator.

2718281828 · Answer

If you didn't use RDD.cache, the RDD computing result would not be persist in memory.For example(there is a rdd data rdd_test)

val rdd_test: RDD[Int] = sc.makeRDD(Array(1,2,3), 1)
val a = rdd_test.map(_+1)
val b = a.map(_+1)

Now, a and b these three rdd data are not in memory. So, when val c = b.map(_+1), a and b will be recomputed. if we use cache on a and b:

val rdd_test: RDD[Int] = sc.makeRDD(Array(1,2,3), 1)
val a = rdd_test.map(_+1).cache
val b = a.map(_+1).cache

Then val c = b.map(_+1), a and b will not be recomputed.

(Please note that: if there is not enough memory, cache method will fail, so a and b will be recompute.

I'm not good at english, sorry.

Why spark keeps on recomputing an RDD?

Tags:

scala

apache-spark

pythonic

2 Answers

Alexey Romanov

2718281828

Recent Activity

Donate For Us

Why spark keeps on recomputing an RDD?

Tags:

scala

apache-spark

pythonic

2 Answers

Alexey Romanov

2718281828

Related questions

Recent Activity

Donate For Us