Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Scala Understanding reduceByKey(_ + _)

I can't understand reduceByKey(_ + _) in the first example of spark with scala

object WordCount {
def main(args: Array[String]): Unit = {
val inputPath = args(0)
val outputPath = args(1)
val sc = new SparkContext()
val lines = sc.textFile(inputPath)
val wordCounts = lines.flatMap {line => line.split(" ")}
.map(word => (word, 1))
.reduceByKey(_ + _)  **I cant't understand this line**
wordCounts.saveAsTextFile(outputPath)
}
}
like image 590
Elsayed Avatar asked May 01 '16 09:05

Elsayed


People also ask

How does reduceByKey work in spark?

Spark reduceByKey Function In Spark, the reduceByKey function is a frequently used transformation operation that performs aggregation of data. It receives key-value pairs (K, V) as an input, aggregates the values based on the key and generates a dataset of (K, V) pairs as an output.

What is the difference between groupByKey and reduceByKey?

Both reduceByKey and groupByKey result in wide transformations which means both triggers a shuffle operation. The key difference between reduceByKey and groupByKey is that reduceByKey does a map side combine and groupByKey does not do a map side combine.

What is the difference between reduce and reduceByKey?

Basically, reduce must pull the entire dataset down into a single location because it is reducing to one final value. reduceByKey on the other hand is one value for each key. And since this action can be run on each machine locally first then it can remain an RDD and have further transformations done on its dataset.

Can we use reduceByKey in spark Dataframe?

The resilient distributed dataframe(RDD) is defined from the list. Further, the dataframe is parallelized using spark and then using reducebykey() function; it is normalized. The output is displayed.


2 Answers

Reduce takes two elements and produce a third after applying a function to the two parameters.

The code you shown is equivalent to the the following

 reduceByKey((x,y)=> x + y)

Instead of defining dummy variables and write a lambda, Scala is smart enough to figure out that what you trying achieve is applying a func (sum in this case) on any two parameters it receives and hence the syntax

 reduceByKey(_ + _) 
like image 104
Sleiman Jneidi Avatar answered Oct 10 '22 11:10

Sleiman Jneidi


reduceByKey takes two parameters, apply a function and returns

reduceByKey(_ + _) is equivalent to reduceByKey((x,y)=> x + y)

Example :

val numbers = Array(1, 2, 3, 4, 5)
val sum = numbers.reduceLeft[Int](_+_)

println("The sum of the numbers one through five is " + sum)

Results :

The sum of the numbers one through five is 15
numbers: Array[Int] = Array(1, 2, 3, 4, 5)
sum: Int = 15

Same reduceByKey(_ ++ _) is equivalent to reduceByKey((x,y)=> x ++ y)

like image 34
vaquar khan Avatar answered Oct 10 '22 11:10

vaquar khan