I'm learning spark recently and confused about the transformation and action operation. I read the spark document and some books about spark, and I know action will cause a spark job to be executed in the cluster while transformation will not. But the operations of rdd listed in spark's api doc are not stated whether it is a transformation or an action operation.
For example, reduce is an action, on the other hand reduceByKey is a transformation! Why could this be.
Transformations create RDDs from each other, but when we want to work with the actual dataset, at that point action is performed. When the action is triggered after the result, new RDD is not formed like transformation. Thus, Actions are Spark RDD operations that give non-RDD values.
It's something that is in the optimization & performance aspect and cannot be seen as Action or Transformation.
In Spark, the role of transformation is to create a new dataset from an existing one. The transformations are considered lazy as they only computed when an action requires a result to be returned to the driver program. Let's see some of the frequently used RDD Transformations.
Transformations are function that apply to RDDs and produce other RDDs in output (ie: map , flatMap , filter , join , groupBy , ...). Actions are the functions that apply to RDDs and produce non-RDD (Array,List...etc) data as output (ie: count , saveAsText , foreach , collect , ...).
You can tell by looking at the return type. An action will return a non-RDD type (your stored value types usually), whereas a transformation will return an RDD[Type]
as it is still just a representation of your computation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With