I have a text file sherlock.txt containing multiple lines of text. I load it in spark-shell using:
val textFile = sc.textFile("sherlock.txt")
My purpose is to count the number of words in the file. I came across two alternative ways to do the job.
First using flatMap:
textFile.flatMap(line => line.split(" ")).count()
Second using map followed by reduce:
textFile.map(line => line.split(" ").size).reduce((a, b) => a + b)
Both yield the same result correctly. I want to know the differences in time and space complexity of the above two alternative implementations, if indeed there is any ?
Does the scala interpreter convert both into the most efficient form ?
Map() operation applies to each element of RDD and it returns the result as new RDD. In the Map, operation developer can define his own custom business logic. While FlatMap() is similar to Map, but FlatMap allows returning 0, 1 or more elements from map function.
The flatMap() method is similar to the map() method, but the only difference is that in flatMap, the inner grouping of an item is removed and a sequence is generated. The flatMap method acts as a shorthand to map a collection and then immediately flatten it.
flatMap() on Spark DataFrame operates similar to RDD, when applied it executes the function specified on every element of the DataFrame by splitting or merging the elements hence, the result count of the flapMap() can be different. This yields below output after flatMap() transformation.
Reduce is a spark action that aggregates a data set (RDD) element using a function. That function takes two arguments and returns one. The function must be (Function | Operator | Map | Mapping | Transformation | Method | Rule | Task | Subroutine) enabled. reduce can return a single value such as an int.
I will argue that the most idiomatic way to handle this would be to map
and sum
:
textFile.map(_.split(" ").size).sum
but the end of the day a total cost will be dominated by line.split(" ")
.
You could probably do a little bit better by iterating over the string manually and counting consecutive whitespaces instead of building new Array
but I doubt it is worth all the fuss in general.
If you prefer a little bit deeper insight count
is defined as:
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
where Utils.getIteratorSize
is pretty much a naive iteration over Iterator
with a sum of ones and sum
is equivalent to
_.fold(0.0)(_ + _)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With