Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the result of RDD transformation in Spark?

Can anyone explain, what is the result of RDD transformations? Is it the new set of data (copy of data) or it is only new set of pointers, to filtered blocks of old data?

like image 213
Speise Avatar asked Feb 14 '15 12:02

Speise


2 Answers

RDD transformations allow you to create dependencies between RDDs. Dependencies are only steps for producing results (a program). Each RDD in lineage chain (string of dependencies) has a function for calculating its data and has a pointer (dependency) to its parent RDD. Spark will divide RDD dependencies into stages and tasks and send those to workers for execution.

So if you do this:

val lines = sc.textFile("...")
val words = lines.flatMap(line => line.split(" "))
val localwords = words.collect()

words will be an RDD containing a reference to lines RDD. When the program is executed, first lines' function will be executed (load the data from a text file), then words' function will be executed on the resulting data (split lines into words). Spark is lazy, so nothing will get executed unless you call some transformation or action that will trigger job creation and execution (collect in this example).

So, an RDD (transformed RDD, too) is not 'a set of data', but a step in a program (might be the only step) telling Spark how to get the data and what to do with it.

like image 187
pzecevic Avatar answered Oct 05 '22 05:10

pzecevic


Transformations create new RDD based on the existing RDD. Basically, RDD's are immutable. All transformations in Spark are lazy.Data in RDD's is not processed until an acton is performed.

Example of RDD transformations: map,filter,flatMap,groupByKey,reduceByKey

like image 31
Ramana Avatar answered Oct 05 '22 05:10

Ramana