Is there a way to change the replication factor of RDDs in Spark?

Question

From what I understand, there are multiple copies of data in RDDs in the cluster, so that in case of failure of a node, the program can recover. However, in cases where chance of failure is negligible, it would be costly memory-wise to have multiple copies of data in the RDDs. So, my question is, is there a parameter in Spark, which can be used to reduce the replication factor of the RDDs?

Francois G · Accepted Answer

First, note Spark does not automatically cache all your RDDs, simply because applications may create many RDDs, and not all of them are to be reused. You have to call .persist() or .cache() on them.

You can set the storage level with which you want to persist an RDD with myRDD.persist(StorageLevel.MEMORY_AND_DISK). .cache() is a shorthand for .persist(StorageLevel.MEMORY_ONLY).

The default storage level for persist is indeed StorageLevel.MEMORY_ONLY for an RDD in Java or Scala – but usually differs if you are creating a DStream (refer to your DStream constructor API doc). If you're using Python, it's StorageLevel.MEMORY_ONLY_SER.

The doc details a number of storage levels and what they mean, but they're fundamentally a configuration shorthand to point Spark to an object which extends the StorageLevel class. You can thus define your own with a replication factor of up to 40.

Note that of the various predefined storage levels, some keep a single copy of the RDD. In fact, that's true of all of those which name isn't postfixed with _2 (except NONE):

DISK_ONLY
MEMORY_ONLY
MEMORY_ONLY_SER
MEMORY_AND_DISK
MEMORY_AND_DISK_SER
OFF_HEAP

That's one copy per medium they employ, of course, if you want a single copy overall, you have to choose a single-medium storage level.

Arnon Rotem-Gal-Oz · Answer

As huitseeker said unless you specifically ask Spark to persist an RDD and specify a StorageLevel that uses a replication, it won't have multiple copies of the partitions of an RDD.

What spark does do is keep a lineage of how a specific piece of data was calculated so that when/if a node fails it only repeats processing of relevant data that is needed to get to the lost RDD partitions - In my experience this mostly works though on occasion it is faster to restart the job then let it recover

Is there a way to change the replication factor of RDDs in Spark?

Tags:

java

scala

apache-spark

hadoop

hadoop-yarn

MetallicPriest

2 Answers

Francois G

Arnon Rotem-Gal-Oz

Recent Activity

Donate For Us

Is there a way to change the replication factor of RDDs in Spark?

Tags:

java

scala

apache-spark

hadoop

hadoop-yarn

MetallicPriest

2 Answers

Francois G

Arnon Rotem-Gal-Oz

Related questions

Recent Activity

Donate For Us