I am new to Spark and scala and working on a simple wordCount example.
So for that i am using countByValue as follows:
val words = lines.flatMap(x => x.split("\\W+")).map(x => x.toLowerCase())
val wordCount = words.countByValue();
which works fine.
And the same thing can be achieved like :
val words = lines.flatMap(x => x.split("\\W+")).map(x => x.toLowerCase())
val wordCounts = words.map(x => (x, 1)).reduceByKey((x, y) => x + y)
val sortedWords = wordCounts.map(x => (x._2, x._1)).sortByKey()
which also works fine.
Now, my question is when to use which methods? Which one is preferred over the other?
At least in PySpark, they are different things.
countByKey
is implemented with reduce
, which means that the driver will collect the partial results of the partitions and does the merge itself. If your result is large, then the driver will have to merge a large number of large dictionaries, which will make the driver crazy.
reduceByKey
shuffles the keys to different executors and does the reduction in every worker, so it is more favorable if the data is large.
In conclusion, when your data is large, use map
, reduceByKey
and collect
will make your driver much happier. If your data is small, countByKey
will introduce less network traffic (one less stage).
The example here - not words, but numbers:
val n = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))
val n2 = n.countByValue
returns a local Map:
n: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at command-3737881976236428:1
n2: scala.collection.Map[Int,Long] = Map(5 -> 1, 1 -> 6, 6 -> 1, 2 -> 3, 7 -> 1, 3 -> 1, 8 -> 1, 4 -> 2)
That is the key difference.
If you want a Map out of the box, then this is the way to go.
Also, the point is that the reduce is implied and cannot be influenced, nor need be provided as in reduceByKey.
The reduceByKey has preference when data sizes are large. The Map is loaded into Driver memory in its entirety.
Adding to all the above answers, here is what i found further:
CountByValue return a map which can not be used in distributed manner.
ReduceByKey returns an rdd which can be further used in a distributed manner.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With