Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiency of flatMap vs map followed by reduce in Spark

I have a text file sherlock.txt containing multiple lines of text. I load it in spark-shell using:

val textFile = sc.textFile("sherlock.txt")

My purpose is to count the number of words in the file. I came across two alternative ways to do the job.

First using flatMap:

textFile.flatMap(line => line.split(" ")).count()

Second using map followed by reduce:

textFile.map(line => line.split(" ").size).reduce((a, b) => a + b)

Both yield the same result correctly. I want to know the differences in time and space complexity of the above two alternative implementations, if indeed there is any ?

Does the scala interpreter convert both into the most efficient form ?

like image 783
Sumandeep Banerjee Avatar asked Mar 30 '16 10:03

Sumandeep Banerjee


People also ask

What is the difference between flatMap and map in spark?

Map() operation applies to each element of RDD and it returns the result as new RDD. In the Map, operation developer can define his own custom business logic. While FlatMap() is similar to Map, but FlatMap allows returning 0, 1 or more elements from map function.

What is the difference between flatMap and map in Scala?

The flatMap() method is similar to the map() method, but the only difference is that in flatMap, the inner grouping of an item is removed and a sequence is generated. The flatMap method acts as a shorthand to map a collection and then immediately flatten it.

How does flatMap work in spark?

flatMap() on Spark DataFrame operates similar to RDD, when applied it executes the function specified on every element of the DataFrame by splitting or merging the elements hence, the result count of the flapMap() can be different. This yields below output after flatMap() transformation.

What is reduce in spark?

Reduce is a spark action that aggregates a data set (RDD) element using a function. That function takes two arguments and returns one. The function must be (Function | Operator | Map | Mapping | Transformation | Method | Rule | Task | Subroutine) enabled. reduce can return a single value such as an int.


1 Answers

I will argue that the most idiomatic way to handle this would be to map and sum:

textFile.map(_.split(" ").size).sum

but the end of the day a total cost will be dominated by line.split(" ").

You could probably do a little bit better by iterating over the string manually and counting consecutive whitespaces instead of building new Array but I doubt it is worth all the fuss in general.

If you prefer a little bit deeper insight count is defined as:

def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum  

where Utils.getIteratorSize is pretty much a naive iteration over Iterator with a sum of ones and sum is equivalent to

_.fold(0.0)(_ + _)
like image 66
2 revs Avatar answered Sep 20 '22 18:09

2 revs