From what I understand, there are multiple copies of data in RDDs in the cluster, so that in case of failure of a node, the program can recover. However, in cases where chance of failure is negligible, it would be costly memory-wise to have multiple copies of data in the RDDs. So, my question is, is there a parameter in Spark, which can be used to reduce the replication factor of the RDDs?
First, note Spark does not automatically cache all your RDD
s, simply because applications may create many RDD
s, and not all of them are to be reused. You have to call .persist()
or .cache()
on them.
You can set the storage level with which you want to persist an RDD
with
myRDD.persist(StorageLevel.MEMORY_AND_DISK)
. .cache()
is a shorthand for .persist(StorageLevel.MEMORY_ONLY)
.
The default storage level for persist
is indeed StorageLevel.MEMORY_ONLY
for an RDD
in Java or Scala – but usually differs if you are creating a DStream
(refer to your DStream
constructor API doc). If you're using Python, it's StorageLevel.MEMORY_ONLY_SER
.
The doc details a number of storage levels and what they mean, but they're fundamentally a configuration shorthand to point Spark to an object which extends the StorageLevel
class. You can thus define your own with a replication factor of up to 40.
Note that of the various predefined storage levels, some keep a single copy of the RDD
. In fact, that's true of all of those which name isn't postfixed with _2
(except NONE
):
That's one copy per medium they employ, of course, if you want a single copy overall, you have to choose a single-medium storage level.
As huitseeker said unless you specifically ask Spark to persist an RDD and specify a StorageLevel that uses a replication, it won't have multiple copies of the partitions of an RDD.
What spark does do is keep a lineage of how a specific piece of data was calculated so that when/if a node fails it only repeats processing of relevant data that is needed to get to the lost RDD partitions - In my experience this mostly works though on occasion it is faster to restart the job then let it recover
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With