Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between reduce and reduceByKey in Apache Spark

Tags:

apache-spark

What is the difference between reduce and reduceByKey in Apache Spark in terms of their functionalities? Why reduceByKey is a transformation and reduce is an action?

like image 489
user1326784 Avatar asked Dec 22 '17 01:12

user1326784


People also ask

What is the difference between reduceByKey and groupByKey in spark?

reduceByKey will aggregate y key before shuffling, and groupByKey will shuffle all the value key pairs as the diagrams show. On large size data the difference is obvious.

Which is better reduceByKey or groupByKey?

Both reduceByKey and groupByKey result in wide transformations which means both triggers a shuffle operation. The key difference between reduceByKey and groupByKey is that reduceByKey does a map side combine and groupByKey does not do a map side combine.

What is spark reduceByKey?

Spark reduceByKey Function In Spark, the reduceByKey function is a frequently used transformation operation that performs aggregation of data. It receives key-value pairs (K, V) as an input, aggregates the values based on the key and generates a dataset of (K, V) pairs as an output.

What is reduceByKey in PySpark?

PySpark reduceByKey() transformation is used to merge the values of each key using an associative reduce function on PySpark RDD. It is a wider transformation as it shuffles data across multiple partitions and It operates on pair RDD (key/value pair).


1 Answers

This is close to a duplicate of my answer explaining reduceByKey, but I will elaborate to the specific part that makes the two different. However refer to my answer for a bit more specifics on the internals of reduceByKey.

Basically, reduce must pull the entire dataset down into a single location because it is reducing to one final value. reduceByKey on the other hand is one value for each key. And since this action can be run on each machine locally first then it can remain an RDD and have further transformations done on its dataset.

Note, however that there is a reduceByKeyLocally you can use to automatically pull down the Map to a single location also.

like image 99
Justin Pihony Avatar answered Oct 12 '22 13:10

Justin Pihony