Can someone explain to me the difference between map and flatMap and what is a good use case for each?
What does "flatten the results" mean? What is it good for?
Both of the functions map() and flatMap are used for transformation and mapping operations. map() function produces one output for one input value, whereas flatMap() function produces an arbitrary no of values as output (ie zero or more than zero) for each input value. Where R is the element type of the new stream.
Furthermore, map() and flatMap() can be distinguished in a way that map() generates a single value against an input while flatMap() generates zero or any number values against an input. In other words, map() is used to transform the data while the flatMap() is used to transform and flatten the stream.
flatMap , as it can be guessed by its name, is the combination of a map and a flat operation. That means that you first apply a function to your elements, and then flatten it. Stream. map only applies a function to the stream without flattening the stream.
It uses One-To-One mapping. It's mapper function produces multiple values (stream of values) for each input value. It's mapper function produces single values for each input value. Use the flatMap() method when the mapper function is producing multiple values for each input value.
Here is an example of the difference, as a spark-shell
session:
First, some data - two lines of text:
val rdd = sc.parallelize(Seq("Roses are red", "Violets are blue")) // lines rdd.collect res0: Array[String] = Array("Roses are red", "Violets are blue")
Now, map
transforms an RDD of length N into another RDD of length N.
For example, it maps from two lines into two line-lengths:
rdd.map(_.length).collect res1: Array[Int] = Array(13, 16)
But flatMap
(loosely speaking) transforms an RDD of length N into a collection of N collections, then flattens these into a single RDD of results.
rdd.flatMap(_.split(" ")).collect res2: Array[String] = Array("Roses", "are", "red", "Violets", "are", "blue")
We have multiple words per line, and multiple lines, but we end up with a single output array of words
Just to illustrate that, flatMapping from a collection of lines to a collection of words looks like:
["aa bb cc", "", "dd"] => [["aa","bb","cc"],[],["dd"]] => ["aa","bb","cc","dd"]
The input and output RDDs will therefore typically be of different sizes for flatMap
.
If we had tried to use map
with our split
function, we'd have ended up with nested structures (an RDD of arrays of words, with type RDD[Array[String]]
) because we have to have exactly one result per input:
rdd.map(_.split(" ")).collect res3: Array[Array[String]] = Array( Array(Roses, are, red), Array(Violets, are, blue) )
Finally, one useful special case is mapping with a function which might not return an answer, and so returns an Option
. We can use flatMap
to filter out the elements that return None
and extract the values from those that return a Some
:
val rdd = sc.parallelize(Seq(1,2,3,4)) def myfn(x: Int): Option[Int] = if (x <= 2) Some(x * 10) else None rdd.flatMap(myfn).collect res3: Array[Int] = Array(10,20)
(noting here that an Option behaves rather like a list that has either one element, or zero elements)
Generally we use word count example in hadoop. I will take the same use case and will use map
and flatMap
and we will see the difference how it is processing the data.
Below is the sample data file.
hadoop is fast hive is sql on hdfs spark is superfast spark is awesome
The above file will be parsed using map
and flatMap
.
map
>>> wc = data.map(lambda line:line.split(" ")); >>> wc.collect() [u'hadoop is fast', u'hive is sql on hdfs', u'spark is superfast', u'spark is awesome']
Input has 4 lines and output size is 4 as well, i.e., N elements ==> N elements.
flatMap
>>> fm = data.flatMap(lambda line:line.split(" ")); >>> fm.collect() [u'hadoop', u'is', u'fast', u'hive', u'is', u'sql', u'on', u'hdfs', u'spark', u'is', u'superfast', u'spark', u'is', u'awesome']
The output is different from map.
Let's assign 1 as value for each key to get the word count.
fm
: RDD created by using flatMap
wc
: RDD created using map
>>> fm.map(lambda word : (word,1)).collect() [(u'hadoop', 1), (u'is', 1), (u'fast', 1), (u'hive', 1), (u'is', 1), (u'sql', 1), (u'on', 1), (u'hdfs', 1), (u'spark', 1), (u'is', 1), (u'superfast', 1), (u'spark', 1), (u'is', 1), (u'awesome', 1)]
Whereas flatMap
on RDD wc
will give the below undesired output:
>>> wc.flatMap(lambda word : (word,1)).collect() [[u'hadoop', u'is', u'fast'], 1, [u'hive', u'is', u'sql', u'on', u'hdfs'], 1, [u'spark', u'is', u'superfast'], 1, [u'spark', u'is', u'awesome'], 1]
You can't get the word count if map
is used instead of flatMap
.
As per the definition, difference between map
and flatMap
is:
map
: It returns a new RDD by applying given function to each element of the RDD. Function inmap
returns only one item.
flatMap
: Similar tomap
, it returns a new RDD by applying a function to each element of the RDD, but output is flattened.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With