Can someone explain to me the difference between map and flatMap and what is a good use case for each? What does "flatten the results" mean? What is it good for?

Here is an example of the difference, as a <code>spark-shell</code> session: First, some data - two lines of text: <pre class="prettyprint lang-scala prettyprint-override"><code>val rdd = sc.parallelize(Seq("Roses are red", "Violets are blue")) // lines rdd.collect res0: Array[String] = Array("Roses are red", "Violets are blue") </code></pre> Now, <code>map</code> transforms an RDD of length N into another RDD of length N. For example, it maps from two lines into two line-lengths: <pre class="prettyprint lang-scala prettyprint-override"><code>rdd.map(_.length).collect res1: Array[Int] = Array(13, 16) </code></pre> But <code>flatMap</code> (loosely speaking) transforms an RDD of length N into a collection of N collections, then flattens these into a single RDD of results. <pre class="prettyprint lang-scala prettyprint-override"><code>rdd.flatMap(_.split(" ")).collect res2: Array[String] = Array("Roses", "are", "red", "Violets", "are", "blue") </code></pre> We have multiple words per line, and multiple lines, but we end up with a single output array of words Just to illustrate that, flatMapping from a collection of lines to a collection of words looks like: <pre class="prettyprint"><code>["aa bb cc", "", "dd"] => [["aa","bb","cc"],[],["dd"]] => ["aa","bb","cc","dd"] </code></pre> The input and output RDDs will therefore typically be of different sizes for <code>flatMap</code>. If we had tried to use <code>map</code> with our <code>split</code> function, we'd have ended up with nested structures (an RDD of arrays of words, with type <code>RDD[Array[String]]</code>) because we have to have exactly one result per input: <pre class="prettyprint lang-scala prettyprint-override"><code>rdd.map(_.split(" ")).collect res3: Array[Array[String]] = Array( Array(Roses, are, red), Array(Violets, are, blue) ) </code></pre> Finally, one useful special case is mapping with a function which might not return an answer, and so returns an <code>Option</code>. We can use <code>flatMap</code> to filter out the elements that return <code>None</code> and extract the values from those that return a <code>Some</code>: <pre class="prettyprint lang-scala prettyprint-override"><code>val rdd = sc.parallelize(Seq(1,2,3,4)) def myfn(x: Int): Option[Int] = if (x <= 2) Some(x * 10) else None rdd.flatMap(myfn).collect res3: Array[Int] = Array(10,20) </code></pre> (noting here that an Option behaves rather like a list that has either one element, or zero elements)

Generally we use word count example in hadoop. I will take the same use case and will use <code>map</code> and <code>flatMap</code> and we will see the difference how it is processing the data. Below is the sample data file. <pre class="prettyprint"><code>hadoop is fast hive is sql on hdfs spark is superfast spark is awesome </code></pre> The above file will be parsed using <code>map</code> and <code>flatMap</code>. <h3>Using <code>map</code> </h3> <pre class="prettyprint"><code>>>> wc = data.map(lambda line:line.split(" ")); >>> wc.collect() [u'hadoop is fast', u'hive is sql on hdfs', u'spark is superfast', u'spark is awesome'] </code></pre> Input has 4 lines and output size is 4 as well, i.e., N elements ==> N elements. <h3>Using <code>flatMap</code> </h3> <pre class="prettyprint"><code>>>> fm = data.flatMap(lambda line:line.split(" ")); >>> fm.collect() [u'hadoop', u'is', u'fast', u'hive', u'is', u'sql', u'on', u'hdfs', u'spark', u'is', u'superfast', u'spark', u'is', u'awesome'] </code></pre> The output is different from map. <hr> Let's assign 1 as value for each key to get the word count. <ul> <li> <code>fm</code>: RDD created by using <code>flatMap</code> </li> <li> <code>wc</code>: RDD created using <code>map</code> </li> </ul> <pre class="prettyprint"><code>>>> fm.map(lambda word : (word,1)).collect() [(u'hadoop', 1), (u'is', 1), (u'fast', 1), (u'hive', 1), (u'is', 1), (u'sql', 1), (u'on', 1), (u'hdfs', 1), (u'spark', 1), (u'is', 1), (u'superfast', 1), (u'spark', 1), (u'is', 1), (u'awesome', 1)] </code></pre> Whereas <code>flatMap</code> on RDD <code>wc</code> will give the below undesired output: <pre class="prettyprint"><code>>>> wc.flatMap(lambda word : (word,1)).collect() [[u'hadoop', u'is', u'fast'], 1, [u'hive', u'is', u'sql', u'on', u'hdfs'], 1, [u'spark', u'is', u'superfast'], 1, [u'spark', u'is', u'awesome'], 1] </code></pre> You can't get the word count if <code>map</code> is used instead of <code>flatMap</code>. As per the definition, difference between <code>map</code> and <code>flatMap</code> is: <blockquote> <code>map</code>: It returns a new RDD by applying given function to each element of the RDD. Function in <code>map</code> returns only one item. <code>flatMap</code>: Similar to <code>map</code>, it returns a new RDD by applying a function to each element of the RDD, but output is flattened. </blockquote>

What is the difference between map and flatMap and a good use case for each?

2 Answers

Here is an example of the difference, as a spark-shell session:

First, some data - two lines of text:

val rdd = sc.parallelize(Seq("Roses are red", "Violets are blue"))  // lines  rdd.collect      res0: Array[String] = Array("Roses are red", "Violets are blue")

Now, map transforms an RDD of length N into another RDD of length N.

For example, it maps from two lines into two line-lengths:

rdd.map(_.length).collect      res1: Array[Int] = Array(13, 16)

But flatMap (loosely speaking) transforms an RDD of length N into a collection of N collections, then flattens these into a single RDD of results.

rdd.flatMap(_.split(" ")).collect      res2: Array[String] = Array("Roses", "are", "red", "Violets", "are", "blue")

We have multiple words per line, and multiple lines, but we end up with a single output array of words

Just to illustrate that, flatMapping from a collection of lines to a collection of words looks like:

["aa bb cc", "", "dd"] => [["aa","bb","cc"],[],["dd"]] => ["aa","bb","cc","dd"]

The input and output RDDs will therefore typically be of different sizes for flatMap.

If we had tried to use map with our split function, we'd have ended up with nested structures (an RDD of arrays of words, with type RDD[Array[String]]) because we have to have exactly one result per input:

rdd.map(_.split(" ")).collect      res3: Array[Array[String]] = Array(                                      Array(Roses, are, red),                                       Array(Violets, are, blue)                                  )

Finally, one useful special case is mapping with a function which might not return an answer, and so returns an Option. We can use flatMap to filter out the elements that return None and extract the values from those that return a Some:

val rdd = sc.parallelize(Seq(1,2,3,4))  def myfn(x: Int): Option[Int] = if (x <= 2) Some(x * 10) else None  rdd.flatMap(myfn).collect      res3: Array[Int] = Array(10,20)

(noting here that an Option behaves rather like a list that has either one element, or zero elements)

answered Oct 07 '22 19:10

DNA

Generally we use word count example in hadoop. I will take the same use case and will use map and flatMap and we will see the difference how it is processing the data.

Below is the sample data file.

hadoop is fast hive is sql on hdfs spark is superfast spark is awesome

The above file will be parsed using map and flatMap.

Using `map`

>>> wc = data.map(lambda line:line.split(" ")); >>> wc.collect() [u'hadoop is fast', u'hive is sql on hdfs', u'spark is superfast', u'spark is awesome']

Input has 4 lines and output size is 4 as well, i.e., N elements ==> N elements.

Using `flatMap`

>>> fm = data.flatMap(lambda line:line.split(" ")); >>> fm.collect() [u'hadoop', u'is', u'fast', u'hive', u'is', u'sql', u'on', u'hdfs', u'spark', u'is', u'superfast', u'spark', u'is', u'awesome']

The output is different from map.

Let's assign 1 as value for each key to get the word count.

fm: RDD created by using flatMap
wc: RDD created using map

>>> fm.map(lambda word : (word,1)).collect() [(u'hadoop', 1), (u'is', 1), (u'fast', 1), (u'hive', 1), (u'is', 1), (u'sql', 1), (u'on', 1), (u'hdfs', 1), (u'spark', 1), (u'is', 1), (u'superfast', 1), (u'spark', 1), (u'is', 1), (u'awesome', 1)]

Whereas flatMap on RDD wc will give the below undesired output:

>>> wc.flatMap(lambda word : (word,1)).collect() [[u'hadoop', u'is', u'fast'], 1, [u'hive', u'is', u'sql', u'on', u'hdfs'], 1, [u'spark', u'is', u'superfast'], 1, [u'spark', u'is', u'awesome'], 1]

You can't get the word count if map is used instead of flatMap.

As per the definition, difference between map and flatMap is:

map: It returns a new RDD by applying given function to each element of the RDD. Function in map returns only one item.

flatMap: Similar to map, it returns a new RDD by applying a function to each element of the RDD, but output is flattened.

answered Oct 07 '22 20:10

yoga

Related questions
                            
                                How are stages split into tasks in Spark?
                            
                                Spark - load CSV file as DataFrame?
                            
                                How to sort by column in descending order in Spark SQL?
                            
                                How to turn off INFO logging in Spark?
                            
                                How do I add a new column to a Spark DataFrame (using PySpark)?
                            
                                How can I change column types in Spark SQL's DataFrame?
                            
                                How to add a constant column in a Spark DataFrame?
                            
                                How to select the first row of each group?
                            
                                How to read multiple text files into a single RDD?
                            
                                Add jars to a Spark Job - spark-submit
                            
                                (Why) do we need to call cache or persist on a RDD
                            
                                Spark performance for Scala vs Python
                            
                                How to stop INFO messages displaying on spark console?
                            
                                Apache Spark: The number of cores vs. the number of executors
                            
                                What is the difference between cache and persist?
                            
                                Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects
                            
                                Spark java.lang.OutOfMemoryError: Java heap space
                            
                                What are workers, executors, cores in Spark Standalone cluster?
                            
                                How to change dataframe column names in pyspark?
                            
                                How to show full column content in a Spark Dataframe?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the difference between map and flatMap and a good use case for each?

Tags:

apache-spark

Eran Witkon

People also ask

2 Answers

DNA

Using `map`

Using `flatMap`

yoga

Recent Activity

Donate For Us

What is the difference between map and flatMap and a good use case for each?

Tags:

apache-spark

Eran Witkon

People also ask

2 Answers

DNA

Using map

Using flatMap

yoga

Related questions

Recent Activity

Donate For Us

Using `map`

Using `flatMap`