When to use countByValue and when to use map().reduceByKey()

Question

I am new to Spark and scala and working on a simple wordCount example.

So for that i am using countByValue as follows:

val words = lines.flatMap(x => x.split("\W+")).map(x => x.toLowerCase())
val wordCount = words.countByValue();

which works fine.

And the same thing can be achieved like :

val words = lines.flatMap(x => x.split("\W+")).map(x => x.toLowerCase())
val wordCounts = words.map(x => (x, 1)).reduceByKey((x, y) => x + y)
val sortedWords = wordCounts.map(x => (x._2, x._1)).sortByKey()

which also works fine.

Now, my question is when to use which methods? Which one is preferred over the other?

DarkZero · Accepted Answer

At least in PySpark, they are different things.

countByKey is implemented with reduce, which means that the driver will collect the partial results of the partitions and does the merge itself. If your result is large, then the driver will have to merge a large number of large dictionaries, which will make the driver crazy.

reduceByKey shuffles the keys to different executors and does the reduction in every worker, so it is more favorable if the data is large.

In conclusion, when your data is large, use map, reduceByKey and collect will make your driver much happier. If your data is small, countByKey will introduce less network traffic (one less stage).

thebluephantom · Answer

The example here - not words, but numbers:

val n = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))
val n2 = n.countByValue

returns a local Map:

n: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at command-3737881976236428:1
n2: scala.collection.Map[Int,Long] = Map(5 -> 1, 1 -> 6, 6 -> 1, 2 -> 3, 7 -> 1, 3 -> 1, 8 -> 1, 4 -> 2)

That is the key difference.

If you want a Map out of the box, then this is the way to go.

Also, the point is that the reduce is implied and cannot be influenced, nor need be provided as in reduceByKey.

The reduceByKey has preference when data sizes are large. The Map is loaded into Driver memory in its entirety.

KayV · Answer

Adding to all the above answers, here is what i found further:

CountByValue return a map which can not be used in distributed manner.
ReduceByKey returns an rdd which can be further used in a distributed manner.

user3282611 · Answer

countByValue() is an RDD action that returns the count of each unique value in this RDD as a dictionary of (value, count) pairs.
reduceByKey() is an RDD transformation that returns an RDD in format of pairs

When to use countByValue and when to use map().reduceByKey()

Tags:

scala

apache-spark

rdd

word-count

KayV

4 Answers

DarkZero

thebluephantom

KayV

user3282611

Recent Activity

Donate For Us

When to use countByValue and when to use map().reduceByKey()

Tags:

scala

apache-spark

rdd

word-count

KayV

4 Answers

DarkZero

thebluephantom

KayV

user3282611

Related questions

Recent Activity

Donate For Us