Is groupByKey ever preferred over reduceByKey

Tags:

I always use reduceByKey when I need to group data in RDDs, because it performs a map side reduce before shuffling data, which often means that less data gets shuffled around and I thus get better performance. Even when the map side reduce function collects all values and does not actually reduce the data amount, I still use reduceByKey, because I'm assuming that the performance of reduceByKey will never be worse than groupByKey. However, I'm wondering if this assumption is correct or if there are indeed situations where groupByKey should be preferred??

874

asked Oct 19 '15 18:10

Glennie Helles Sindholt

2 Answers

I believe there are other aspects of the problem ignored by climbage and eliasah:

code readability
code maintainability
codebase size

If operation doesn't reduce amount of data it has to be one way or another semantically equivalent to GroupByKey. Lets assume we haveRDD[(Int,String)]:

import scala.util.Random
Random.setSeed(1)

def randomString = Random.alphanumeric.take(Random.nextInt(10)).mkString("")

val rdd = sc.parallelize((1 to 20).map(_ => (Random.nextInt(5), randomString)))

and we want to concatenate all strings for a given key. With groupByKey it is pretty simple:

rdd.groupByKey.mapValues(_.mkString(""))

Naive solution with reduceByKey looks like this:

rdd.reduceByKey(_ + _)

It is short and arguably easy to understand but suffers from two issues:

is extremely inefficient since it creates a new String object every time*
suggests that operation you perform is less expensive than it is in reality, especially if you analyze only DAG or debug string

To deal with the first problem we need a mutable data structure:

import scala.collection.mutable.StringBuilder

rdd.combineByKey[StringBuilder](
    (s: String) => new StringBuilder(s),
    (sb: StringBuilder, s: String) => sb ++= s,
    (sb1: StringBuilder, sb2: StringBuilder) => sb1.append(sb2)
).mapValues(_.toString)

It still suggests something else that is really going on and is quite verbose especially if repeated multiple times in your script. You can of course extract anonymous functions

val createStringCombiner = (s: String) => new StringBuilder(s)
val mergeStringValue = (sb: StringBuilder, s: String) => sb ++= s
val mergeStringCombiners = (sb1: StringBuilder, sb2: StringBuilder) => 
  sb1.append(sb2)

rdd.combineByKey(createStringCombiner, mergeStringValue, mergeStringCombiners)

but at the end of the day it still means additional effort to understand this code, increased complexity and no real added value. One thing I find particularly troubling is explicit inclusion of mutable data structures. Even if Spark handles almost all complexity it means we no longer have an elegant, referentially transparent code.

My point is if you really reduce amount of data by all means use reduceByKey. Otherwise you make your code harder to write, harder to analyze and gain nothing in return.

Note:

This answer is focused on Scala RDD API. Current Python implementation is quite different from its JVM counterpart and includes optimizations which provide significant advantage over naive reduceByKey implementation in case of groupBy-like operations.

For Dataset API see DataFrame / Dataset groupBy behaviour/optimization.

* See Spark performance for Scala vs Python for a convincing example

137

answered Oct 05 '22 19:10

zero323

reduceByKey and groupByKey both use combineByKey with different combine/merge semantics.

They key difference I see is that groupByKey passes the flag (mapSideCombine=false) to the shuffle engine. Judging by the issue SPARK-772, this is a hint to the shuffle engine to not run the mapside combiner when the data size isn't going to change.

So I would say that if you are trying to use reduceByKey to replicate groupByKey, you might see a slight performance hit.

answered Oct 05 '22 19:10

Mike Park

Related questions
                            
                                org.apache.spark.SparkException: Job aborted due to stage failure: Task from application
                            
                                "sparkContext was shut down" while running spark on a large dataset
                            
                                Total size of serialized results of tasks is bigger than spark.driver.maxResultSize
                            
                                Spark 2.0 deprecates 'DirectParquetOutputCommitter', how to live without it?
                            
                                What is the best way to remove accents with Apache Spark dataframes in PySpark?
                            
                                Hash function in spark
                            
                                Spark - Which instance type is preferred for AWS EMR cluster? [closed]
                            
                                Spark losing println() on stdout
                            
                                How to stop a running SparkContext before opening the new one
                            
                                How to merge multiple feature vectors in DataFrame?
                            
                                Spark train test split
                            
                                Stopping a Running Spark Application
                            
                                Where are the Spark logs on EMR?
                            
                                ImportError: No module named numpy on spark workers
                            
                                PySpark converting a column of type 'map' to multiple columns in a dataframe
                            
                                Accessing Spark SQL RDD tables through the Thrift Server
                            
                                Spark save(write) parquet only one file
                            
                                Using Grouped Map Pandas UDFs with arguments
                            
                                How to use custom classes with Apache Spark (pyspark)?
                            
                                Increase Spark memory when using local[*]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With