Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between map and flatMap and a good use case for each?

Tags:

apache-spark

Can someone explain to me the difference between map and flatMap and what is a good use case for each?

What does "flatten the results" mean? What is it good for?

like image 598
Eran Witkon Avatar asked Mar 12 '14 11:03

Eran Witkon


People also ask

What is the difference between flatMap and map?

Both of the functions map() and flatMap are used for transformation and mapping operations. map() function produces one output for one input value, whereas flatMap() function produces an arbitrary no of values as output (ie zero or more than zero) for each input value. Where R is the element type of the new stream.

What is the difference between map () & flatMap () A map () does one to one transformation b flatMap () does one to many transformation C all the above?

Furthermore, map() and flatMap() can be distinguished in a way that map() generates a single value against an input while flatMap() generates zero or any number values against an input. In other words, map() is used to transform the data while the flatMap() is used to transform and flatten the stream.

What is a flatMap?

flatMap , as it can be guessed by its name, is the combination of a map and a flat operation. That means that you first apply a function to your elements, and then flatten it. Stream. map only applies a function to the stream without flattening the stream.

Why do we need flatMap?

It uses One-To-One mapping. It's mapper function produces multiple values (stream of values) for each input value. It's mapper function produces single values for each input value. Use the flatMap() method when the mapper function is producing multiple values for each input value.


2 Answers

Here is an example of the difference, as a spark-shell session:

First, some data - two lines of text:

val rdd = sc.parallelize(Seq("Roses are red", "Violets are blue"))  // lines  rdd.collect      res0: Array[String] = Array("Roses are red", "Violets are blue") 

Now, map transforms an RDD of length N into another RDD of length N.

For example, it maps from two lines into two line-lengths:

rdd.map(_.length).collect      res1: Array[Int] = Array(13, 16) 

But flatMap (loosely speaking) transforms an RDD of length N into a collection of N collections, then flattens these into a single RDD of results.

rdd.flatMap(_.split(" ")).collect      res2: Array[String] = Array("Roses", "are", "red", "Violets", "are", "blue") 

We have multiple words per line, and multiple lines, but we end up with a single output array of words

Just to illustrate that, flatMapping from a collection of lines to a collection of words looks like:

["aa bb cc", "", "dd"] => [["aa","bb","cc"],[],["dd"]] => ["aa","bb","cc","dd"] 

The input and output RDDs will therefore typically be of different sizes for flatMap.

If we had tried to use map with our split function, we'd have ended up with nested structures (an RDD of arrays of words, with type RDD[Array[String]]) because we have to have exactly one result per input:

rdd.map(_.split(" ")).collect      res3: Array[Array[String]] = Array(                                      Array(Roses, are, red),                                       Array(Violets, are, blue)                                  ) 

Finally, one useful special case is mapping with a function which might not return an answer, and so returns an Option. We can use flatMap to filter out the elements that return None and extract the values from those that return a Some:

val rdd = sc.parallelize(Seq(1,2,3,4))  def myfn(x: Int): Option[Int] = if (x <= 2) Some(x * 10) else None  rdd.flatMap(myfn).collect      res3: Array[Int] = Array(10,20) 

(noting here that an Option behaves rather like a list that has either one element, or zero elements)

like image 68
DNA Avatar answered Oct 07 '22 19:10

DNA


Generally we use word count example in hadoop. I will take the same use case and will use map and flatMap and we will see the difference how it is processing the data.

Below is the sample data file.

hadoop is fast hive is sql on hdfs spark is superfast spark is awesome 

The above file will be parsed using map and flatMap.

Using map

>>> wc = data.map(lambda line:line.split(" ")); >>> wc.collect() [u'hadoop is fast', u'hive is sql on hdfs', u'spark is superfast', u'spark is awesome'] 

Input has 4 lines and output size is 4 as well, i.e., N elements ==> N elements.

Using flatMap

>>> fm = data.flatMap(lambda line:line.split(" ")); >>> fm.collect() [u'hadoop', u'is', u'fast', u'hive', u'is', u'sql', u'on', u'hdfs', u'spark', u'is', u'superfast', u'spark', u'is', u'awesome'] 

The output is different from map.


Let's assign 1 as value for each key to get the word count.

  • fm: RDD created by using flatMap
  • wc: RDD created using map
>>> fm.map(lambda word : (word,1)).collect() [(u'hadoop', 1), (u'is', 1), (u'fast', 1), (u'hive', 1), (u'is', 1), (u'sql', 1), (u'on', 1), (u'hdfs', 1), (u'spark', 1), (u'is', 1), (u'superfast', 1), (u'spark', 1), (u'is', 1), (u'awesome', 1)] 

Whereas flatMap on RDD wc will give the below undesired output:

>>> wc.flatMap(lambda word : (word,1)).collect() [[u'hadoop', u'is', u'fast'], 1, [u'hive', u'is', u'sql', u'on', u'hdfs'], 1, [u'spark', u'is', u'superfast'], 1, [u'spark', u'is', u'awesome'], 1] 

You can't get the word count if map is used instead of flatMap.

As per the definition, difference between map and flatMap is:

map: It returns a new RDD by applying given function to each element of the RDD. Function in map returns only one item.

flatMap: Similar to map, it returns a new RDD by applying a function to each element of the RDD, but output is flattened.

like image 33
yoga Avatar answered Oct 07 '22 20:10

yoga