Recently I had scenario to store the the data in keyValue Pair and came across a function reduceByKey(_ ++ _)
. This is more of shorthand syntax. I am not able to understand what this actually means.
Ex: reduceBykey(_ + _)
means reduceByKey((a,b)=>(a+b))
So reduceByKey(_ ++ _)
means ??
I am able to create Key value pair out of data using reduceByKey(_ ++ _)
.
val y = sc.textFile("file:///root/My_Spark_learning/reduced.txt")
y.map(value=>value.split(","))
.map(value=>(value(0),value(1),value(2)))
.collect
.foreach(println)
(1,2,3)
(1,3,4)
(4,5,6)
(7,8,9)
y.map(value=>value.split(","))
.map(value=>(value(0),Seq(value(1),value(2))))
.reduceByKey(_ ++ _)
.collect
.foreach(println)
(1,List(2, 3, 3, 4))
(4,List(5, 6))
(7,List(8, 9))
In Spark, the reduceByKey function is a frequently used transformation operation that performs aggregation of data. It receives key-value pairs (K, V) as an input, aggregates the values based on the key and generates a dataset of (K, V) pairs as an output.
It merges data locally before sending data across partitions for optimize data shuffling. It uses associative reduce function, where it merges value of each key. It can be used with Rdd only in key value pair. It's wide operation which shuffles data from multiple partitions/divisions and creates another RDD.
In this article, you have learned Spark RDD reduceByKey() transformation is used to merge the values of each key using an associative reduce function and learned it is a wider transformation that shuffles the data across RDD partitions.
reduceByKey(_ ++ _)
translates to reduceByKey((a,b) => a ++ b)
.
++
is a method defined on List
that concatenates another list to it.
So, for key 1 in the sample data, a
will be List(2,3)
and b
will be List(3,4)
and hence the concatenation of List(2,3)
and List(3,4)
(List(2,3) ++ List(3,4)
) would yield List(2,3,3,4)
.
reduceByKey(_ ++ _)
is equivalent to reduceByKey((x,y)=> x ++ y)
reduceByKey
takes two parameters, apply a function and returns
At the first it crates a set and ++
just adds collections together, combining elements of both sets.
For each key It keeps appending in the list. In your case of 1 as a key x will be List(2,3)
and y will List (3,4)
and ++
will add both as List (2,3,3,4)
If you had another value like (1,4,5)
then the x would be List(4,5)
in this case and y should be List (2,3,3,4)
and result would be List(2,3,3,4,4,5)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With