Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to change the replication factor of RDDs in Spark?

From what I understand, there are multiple copies of data in RDDs in the cluster, so that in case of failure of a node, the program can recover. However, in cases where chance of failure is negligible, it would be costly memory-wise to have multiple copies of data in the RDDs. So, my question is, is there a parameter in Spark, which can be used to reduce the replication factor of the RDDs?

like image 969
MetallicPriest Avatar asked Jul 25 '15 08:07

MetallicPriest


2 Answers

First, note Spark does not automatically cache all your RDDs, simply because applications may create many RDDs, and not all of them are to be reused. You have to call .persist() or .cache() on them.

You can set the storage level with which you want to persist an RDD with myRDD.persist(StorageLevel.MEMORY_AND_DISK). .cache() is a shorthand for .persist(StorageLevel.MEMORY_ONLY).

The default storage level for persist is indeed StorageLevel.MEMORY_ONLY for an RDD in Java or Scala – but usually differs if you are creating a DStream (refer to your DStream constructor API doc). If you're using Python, it's StorageLevel.MEMORY_ONLY_SER.

The doc details a number of storage levels and what they mean, but they're fundamentally a configuration shorthand to point Spark to an object which extends the StorageLevel class. You can thus define your own with a replication factor of up to 40.

Note that of the various predefined storage levels, some keep a single copy of the RDD. In fact, that's true of all of those which name isn't postfixed with _2 (except NONE):

  • DISK_ONLY
  • MEMORY_ONLY
  • MEMORY_ONLY_SER
  • MEMORY_AND_DISK
  • MEMORY_AND_DISK_SER
  • OFF_HEAP

That's one copy per medium they employ, of course, if you want a single copy overall, you have to choose a single-medium storage level.

like image 91
Francois G Avatar answered Oct 13 '22 16:10

Francois G


As huitseeker said unless you specifically ask Spark to persist an RDD and specify a StorageLevel that uses a replication, it won't have multiple copies of the partitions of an RDD.

What spark does do is keep a lineage of how a specific piece of data was calculated so that when/if a node fails it only repeats processing of relevant data that is needed to get to the lost RDD partitions - In my experience this mostly works though on occasion it is faster to restart the job then let it recover

like image 22
Arnon Rotem-Gal-Oz Avatar answered Oct 13 '22 16:10

Arnon Rotem-Gal-Oz