Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark mapPartitions vs transient lazy val

I was wondering what is the different of using sparks mapPartitions functionality vs transient lazy val.
Since each partition is basically running on a different node a single instance of transient lazy val will be created on each nod (assuming its in an object).

For example:

class NotSerializable(v: Int) {
  def foo(a: Int) = ???
}

object OnePerPartition {
  @transient lazy val obj: NotSerializable = new NotSerializable(10)
}

object Test extends App{
    val conf = new SparkConf().setMaster("local[2]").setAppName("test")
    val sc = new SparkContext(conf)

    val rdd: RDD[Int] = sc.parallelize(1 to 100000)

    rdd.map(OnePerPartition.obj.foo)

    // ---------- VS ----------

    rdd.mapPartitions(itr => {
      val obj = new NotSerializable(10)
      itr.map(obj.foo)
    })
}

One might ask why would you even want it...
I would like to create a general container notion for running my logic on any generic collection implementation (RDD, List, scalding pipe, etc.)
All of them have a notion of "map", but mapPartition is unique for spark.

like image 947
Noam Shaish Avatar asked Nov 23 '16 20:11

Noam Shaish


1 Answers

First of all you don't need transient lazy here. Using object wrapper is enough to make this work and you can actually write this as:

object OnePerExecutor {
  val obj: NotSerializable = new NotSerializable(10)
}

There is a fundamental difference between the object wrapper and initializing NotSerializable inside mapPartitions. This:

rdd.mapPartitions(iter => {
  val ns = NotSerializable(1)
  ???
})

creates a single NotSerializable instance per partition.

Object wrapper from the other hand, creates a single NotSerializable instance for each executor JVM. As a result this instance:

  • Can be used to process multiple partitions.
  • Can be accessed simultaneously by multiple executor threads.
  • Has lifespan exceeding function call where it is used.

It means it should be thread safe and any method calls should be side effects free.

like image 190
zero323 Avatar answered Oct 06 '22 23:10

zero323