What is the result of RDD transformation in Spark?

Question

Can anyone explain, what is the result of RDD transformations? Is it the new set of data (copy of data) or it is only new set of pointers, to filtered blocks of old data?

pzecevic · Accepted Answer

RDD transformations allow you to create dependencies between RDDs. Dependencies are only steps for producing results (a program). Each RDD in lineage chain (string of dependencies) has a function for calculating its data and has a pointer (dependency) to its parent RDD. Spark will divide RDD dependencies into stages and tasks and send those to workers for execution.

So if you do this:

val lines = sc.textFile("...")
val words = lines.flatMap(line => line.split(" "))
val localwords = words.collect()

words will be an RDD containing a reference to lines RDD. When the program is executed, first lines' function will be executed (load the data from a text file), then words' function will be executed on the resulting data (split lines into words). Spark is lazy, so nothing will get executed unless you call some transformation or action that will trigger job creation and execution (collect in this example).

So, an RDD (transformed RDD, too) is not 'a set of data', but a step in a program (might be the only step) telling Spark how to get the data and what to do with it.

Ramana · Answer

Transformations create new RDD based on the existing RDD. Basically, RDD's are immutable. All transformations in Spark are lazy.Data in RDD's is not processed until an acton is performed.

Example of RDD transformations: map,filter,flatMap,groupByKey,reduceByKey

What is the result of RDD transformation in Spark?

Tags:

apache-spark

rdd

Speise

2 Answers

pzecevic

Ramana

Recent Activity

Donate For Us

What is the result of RDD transformation in Spark?

Tags:

apache-spark

rdd

Speise

2 Answers

pzecevic

Ramana

Related questions

Recent Activity

Donate For Us