What is the difference between reduce and reduceByKey in Apache Spark in terms of their functionalities? Why reduceByKey is a transformation and reduce is an action?

This is close to a duplicate of my answer explaining reduceByKey, but I will elaborate to the specific part that makes the two different. However refer to my answer for a bit more specifics on the internals of <code>reduceByKey</code>. Basically, <code>reduce</code> must pull the entire dataset down into a single location because it is reducing to one final value. <code>reduceByKey</code> on the other hand is one value for each key. And since this action can be run on each machine locally first then it can remain an RDD and have further transformations done on its dataset. Note, however that there is a <code>reduceByKeyLocally</code> you can use to automatically pull down the Map to a single location also.

Difference between reduce and reduceByKey in Apache Spark

1 Answers

This is close to a duplicate of my answer explaining reduceByKey, but I will elaborate to the specific part that makes the two different. However refer to my answer for a bit more specifics on the internals of reduceByKey.

Basically, reduce must pull the entire dataset down into a single location because it is reducing to one final value. reduceByKey on the other hand is one value for each key. And since this action can be run on each machine locally first then it can remain an RDD and have further transformations done on its dataset.

Note, however that there is a reduceByKeyLocally you can use to automatically pull down the Map to a single location also.

answered Oct 12 '22 13:10

Justin Pihony

Related questions
                            
                                How to set preferences for ALS implicit feedback in Collaborative Filtering?
                            
                                Spark execution memory monitoring [closed]
                            
                                Writing more than 50 millions from Pyspark df to PostgresSQL, best efficient approach
                            
                                Spark: Writing to Avro file
                            
                                Apache Spark: pyspark crash for large dataset
                            
                                Understanding Spark's closures and their serialization
                            
                                apache spark MLLib: how to build labeled points for string features?
                            
                                How to suppress parquet log messages in Spark?
                            
                                Apache spark: setting spark.eventLog.enabled and spark.eventLog.dir at submit or Spark start
                            
                                How to create Spark RDD from an iterator?
                            
                                How does Apache Spark know about HDFS data nodes?
                            
                                Apache Spark throws NullPointerException when encountering missing feature
                            
                                In Spark, what is the right way to have a static object on all workers?
                            
                                Spark DataFrame Schema Nullable Fields
                            
                                Coalesce reduces parallelism of entire stage (spark)
                            
                                How to use java.time.LocalDate in Datasets (fails with java.lang.UnsupportedOperationException: No Encoder found)? [duplicate]
                            
                                Saving dataframe to local file system results in empty results
                            
                                Does groupByKey in Spark preserve the original order?
                            
                                Spark on Amazon EMR: "Timeout waiting for connection from pool"
                            
                                How to execute Spark programs with Dynamic Resource Allocation?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Difference between reduce and reduceByKey in Apache Spark

Tags:

apache-spark

user1326784

People also ask

1 Answers

Justin Pihony

Recent Activity

Donate For Us