Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When to use countByValue and when to use map().reduceByKey()

I am new to Spark and scala and working on a simple wordCount example.

So for that i am using countByValue as follows:

val words = lines.flatMap(x => x.split("\\W+")).map(x => x.toLowerCase())
val wordCount = words.countByValue();

which works fine.

And the same thing can be achieved like :

val words = lines.flatMap(x => x.split("\\W+")).map(x => x.toLowerCase())
val wordCounts = words.map(x => (x, 1)).reduceByKey((x, y) => x + y)
val sortedWords = wordCounts.map(x => (x._2, x._1)).sortByKey()

which also works fine.

Now, my question is when to use which methods? Which one is preferred over the other?

like image 202
KayV Avatar asked Oct 21 '18 12:10

KayV


4 Answers

At least in PySpark, they are different things.

countByKey is implemented with reduce, which means that the driver will collect the partial results of the partitions and does the merge itself. If your result is large, then the driver will have to merge a large number of large dictionaries, which will make the driver crazy.

reduceByKey shuffles the keys to different executors and does the reduction in every worker, so it is more favorable if the data is large.

In conclusion, when your data is large, use map, reduceByKey and collect will make your driver much happier. If your data is small, countByKey will introduce less network traffic (one less stage).

like image 182
DarkZero Avatar answered Oct 17 '22 01:10

DarkZero


The example here - not words, but numbers:

val n = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))
val n2 = n.countByValue

returns a local Map:

n: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at command-3737881976236428:1
n2: scala.collection.Map[Int,Long] = Map(5 -> 1, 1 -> 6, 6 -> 1, 2 -> 3, 7 -> 1, 3 -> 1, 8 -> 1, 4 -> 2)

That is the key difference.

If you want a Map out of the box, then this is the way to go.

Also, the point is that the reduce is implied and cannot be influenced, nor need be provided as in reduceByKey.

The reduceByKey has preference when data sizes are large. The Map is loaded into Driver memory in its entirety.

like image 31
thebluephantom Avatar answered Oct 17 '22 01:10

thebluephantom


Adding to all the above answers, here is what i found further:

  1. CountByValue return a map which can not be used in distributed manner.

  2. ReduceByKey returns an rdd which can be further used in a distributed manner.

like image 35
KayV Avatar answered Oct 16 '22 23:10

KayV


  1. countByValue() is an RDD action that returns the count of each unique value in this RDD as a dictionary of (value, count) pairs.
  2. reduceByKey() is an RDD transformation that returns an RDD in format of pairs
like image 45
user3282611 Avatar answered Oct 16 '22 23:10

user3282611