Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Transformation process in Apache Spark

Transformations create new RDD based on the existing RDD. Basically, RDDs are immutable and all transformations in Spark are lazy. Data in RDDs is not processed until an action is performed but without processing the data, how are new RDDs created?? For example, in filter operation how are new RDD created without actually loading the RDDs into memory and processing it?

like image 972
Prashant_J Avatar asked Feb 06 '23 05:02

Prashant_J


2 Answers

Question : For example, in filter operation how are new RDD created without actually loading the RDDs into memory and processing it?

Transformation process in Apache Spark:

enter image description here

For ex :

firstRDD=spark.textFile("hdfs://...")

secondRDD=firstRDD.filter(someFunction);

thirdRDD = secondRDD.map(someFunction);

result = thirdRDD.count()

Since RDDs are created over a set of transformations , it logs those transformations, rather than actual data(this is like action plan of what needs to be done if we filter with this particular predivate).Graph of these transformations to produce one RDD is called as Lineage Graph like below.

Spark RDD Lineage Graph in this above example would be :

enter image description here

Please see RDD.scala It will create new RDD only if encounters that predicate using your filter.. which is like action plan. This plan will be executed only when if you call action like count.

/*** Return a new RDD containing only the elements that satisfy a predicate.
       */
      def filter(f: T => Boolean): RDD[T] = withScope {
        val cleanF = sc.clean(f)
        new MapPartitionsRDD[T, T](
          this,
          (context, pid, iter) => iter.filter(cleanF),
          preservesPartitioning = true)
      }
  • Lazy evaluation means that when we call a transformation on an RDD (for instance,calling map() ), the operation is not immediately performed.
  • Instead, Spark internally records metadata to indicate that this operation has been requested. Rather than thinking of an RDD as containing specific data, it is best to think of each RDD as consisting of instructions on how to compute the data that we build up through transformations.
  • Loading data into an RDD is lazily evaluated in the same way transormations are. So, when we call sc.textFile() , the data is not loaded until it is necessary. As with transformations, the operation (in this case, reading the data) can occur multiple times.

Lazy evaluations : (correcting your quote "all transformations in Spark are lazy" to "all transformations in Spark are lazily evaluated")

Spark computes RDDs lazily the first time they are used in an action, so that it can pipeline transformations. So , in above example RDD will be evaluated only when count() action is invoked.

Hope that helps...

like image 96
Ram Ghadiyaram Avatar answered Feb 08 '23 19:02

Ram Ghadiyaram


Spark transformations are lazy in operation. These operations does not get computed right away, it just remember the transformation applied on RDD and returns the pointer to the operation output. The transformation operations get computed only when action gets applied on it. Once an action gets applied, spark breaks the operations into tasks and distribute them on the nodes for execution.

like image 29
Hokam Avatar answered Feb 08 '23 19:02

Hokam