I have a text file sherlock.txt containing multiple lines of text. I load it in spark-shell using: <pre class="prettyprint"><code>val textFile = sc.textFile("sherlock.txt") </code></pre> My purpose is to count the number of words in the file. I came across two alternative ways to do the job. First using flatMap: <pre class="prettyprint"><code>textFile.flatMap(line => line.split(" ")).count() </code></pre> Second using map followed by reduce: <pre class="prettyprint"><code>textFile.map(line => line.split(" ").size).reduce((a, b) => a + b) </code></pre> Both yield the same result correctly. I want to know the differences in time and space complexity of the above two alternative implementations, if indeed there is any ? Does the scala interpreter convert both into the most efficient form ?

I will argue that the most idiomatic way to handle this would be to <code>map</code> and <code>sum</code>: <pre class="prettyprint"><code>textFile.map(_.split(" ").size).sum </code></pre> but the end of the day a total cost will be dominated by <code>line.split(" ")</code>. You could probably do a little bit better by iterating over the string manually and counting consecutive whitespaces instead of building new <code>Array</code> but I doubt it is worth all the fuss in general. If you prefer a little bit deeper insight <code>count</code> is defined as: <pre class="prettyprint"><code>def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum </code></pre> where <code>Utils.getIteratorSize</code> is pretty much a naive iteration over <code>Iterator</code> with a sum of ones and <code>sum</code> is equivalent to <pre class="prettyprint"><code>_.fold(0.0)(_ + _) </code></pre>

Efficiency of flatMap vs map followed by reduce in Spark

Tags:

scala

apache-spark

rdd

mapreduce

flatmap

I have a text file sherlock.txt containing multiple lines of text. I load it in spark-shell using:

val textFile = sc.textFile("sherlock.txt")

My purpose is to count the number of words in the file. I came across two alternative ways to do the job.

First using flatMap:

textFile.flatMap(line => line.split(" ")).count()

Second using map followed by reduce:

textFile.map(line => line.split(" ").size).reduce((a, b) => a + b)

Both yield the same result correctly. I want to know the differences in time and space complexity of the above two alternative implementations, if indeed there is any ?

Does the scala interpreter convert both into the most efficient form ?

783

asked Mar 30 '16 10:03

Sumandeep Banerjee

1 Answers

I will argue that the most idiomatic way to handle this would be to map and sum:

textFile.map(_.split(" ").size).sum

but the end of the day a total cost will be dominated by line.split(" ").

You could probably do a little bit better by iterating over the string manually and counting consecutive whitespaces instead of building new Array but I doubt it is worth all the fuss in general.

If you prefer a little bit deeper insight count is defined as:

def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

where Utils.getIteratorSize is pretty much a naive iteration over Iterator with a sum of ones and sum is equivalent to

_.fold(0.0)(_ + _)

answered Sep 20 '22 18:09

2 revs

Related questions
                            
                                Is Java FileInputStream Locking File for writing
                            
                                Scala name mangling of private fields and JavaFX FXML injection
                            
                                how to bind RequestReader to Route in Finch
                            
                                Difference between higher-kinded type members declaration
                            
                                Akka scheduler stops on exception; is it expected?
                            
                                In Apache-spark, how to add the sparse vector?
                            
                                Fold on NonEmptyList
                            
                                Constraining Type Signatures for Left and Right
                            
                                Spark + Kafka integration - mapping of Kafka partitions to RDD partitions
                            
                                Actor and Future: Referring to an actor message within onComplete
                            
                                Lazy formatted recursive JSON type can't be found as implicit value
                            
                                graphQL multiple mutations transaction
                            
                                akka-http : could not find implicit value for parameter unmarshalling
                            
                                Akka Actors unit testing for dummies
                            
                                Why Free is not monad instance in Scalaz 7.1.5?
                            
                                Regrouping / Concatenating DataFrame rows in Spark
                            
                                Creating and accumulating a Map of Map of Map ... in scala
                            
                                How to test a controller method that uses a custom parser in play 2.5?
                            
                                Merge multiple RDD generated in loop
                            
                                Accept any case class which extends a trait as argument in scala

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With