Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the meaning for reduceByKey(_ ++ _)

Recently I had scenario to store the the data in keyValue Pair and came across a function reduceByKey(_ ++ _) . This is more of shorthand syntax. I am not able to understand what this actually means.

Ex: reduceBykey(_ + _) means reduceByKey((a,b)=>(a+b))

So reduceByKey(_ ++ _) means ??

I am able to create Key value pair out of data using reduceByKey(_ ++ _).

val y = sc.textFile("file:///root/My_Spark_learning/reduced.txt")

y.map(value=>value.split(","))
  .map(value=>(value(0),value(1),value(2)))
  .collect
  .foreach(println)

(1,2,3)
(1,3,4)
(4,5,6)
(7,8,9)

y.map(value=>value.split(","))
  .map(value=>(value(0),Seq(value(1),value(2))))
  .reduceByKey(_ ++ _)
  .collect
  .foreach(println)

(1,List(2, 3, 3, 4))
(4,List(5, 6))
(7,List(8, 9))
like image 809
Rohan Nayak Avatar asked May 23 '17 05:05

Rohan Nayak


People also ask

What does reduceByKey do in spark?

In Spark, the reduceByKey function is a frequently used transformation operation that performs aggregation of data. It receives key-value pairs (K, V) as an input, aggregates the values based on the key and generates a dataset of (K, V) pairs as an output.

When to use reduce by key?

It merges data locally before sending data across partitions for optimize data shuffling. It uses associative reduce function, where it merges value of each key. It can be used with Rdd only in key value pair. It's wide operation which shuffles data from multiple partitions/divisions and creates another RDD.

Is reduce by key wide transformation?

In this article, you have learned Spark RDD reduceByKey() transformation is used to merge the values of each key using an associative reduce function and learned it is a wider transformation that shuffles the data across RDD partitions.


Video Answer


2 Answers

reduceByKey(_ ++ _) translates to reduceByKey((a,b) => a ++ b).

++ is a method defined on List that concatenates another list to it.

So, for key 1 in the sample data, a will be List(2,3) and b will be List(3,4) and hence the concatenation of List(2,3) and List(3,4) (List(2,3) ++ List(3,4)) would yield List(2,3,3,4).

like image 81
rogue-one Avatar answered Oct 17 '22 05:10

rogue-one


reduceByKey(_ ++ _) is equivalent to reduceByKey((x,y)=> x ++ y) reduceByKey takes two parameters, apply a function and returns

At the first it crates a set and ++ just adds collections together, combining elements of both sets.

For each key It keeps appending in the list. In your case of 1 as a key x will be List(2,3) and y will List (3,4) and ++ will add both as List (2,3,3,4)

If you had another value like (1,4,5) then the x would be List(4,5) in this case and y should be List (2,3,3,4) and result would be List(2,3,3,4,4,5)

like image 44
koiralo Avatar answered Oct 17 '22 04:10

koiralo