I'm currently learning Spark and developing custom machine learning algorithms. My question is what is the difference between <code>.map()</code> and <code>.mapValues()</code> and what are cases where I clearly have to use one instead of the other?

When we use map() with a Pair RDD, we get access to both Key & value. few times we are only interested in accessing the value(& not key). In those case, we can use mapValues() instead of map(). Example of mapValues <pre class="prettyprint"><code>val inputrdd = sc.parallelize(Seq(("maths", 50), ("maths", 60), ("english", 65))) val mapped = inputrdd.mapValues(mark => (mark, 1)); // val reduced = mapped.reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2)) reduced.collect </code></pre> Array[(String, (Int, Int))] = Array((english,(65,1)), (maths,(110,2))) <pre class="prettyprint"><code>val average = reduced.map { x => val temp = x._2 val total = temp._1 val count = temp._2 (x._1, total / count) } average.collect() </code></pre> res1: Array[(String, Int)] = Array((english,65), (maths,55))

map vs mapValues in Spark

Tags:

scala

apache-spark

I'm currently learning Spark and developing custom machine learning algorithms. My question is what is the difference between .map() and .mapValues() and what are cases where I clearly have to use one instead of the other?

357

asked Apr 18 '16 14:04

jtitusj

2 Answers

mapValues is only applicable for PairRDDs, meaning RDDs of the form RDD[(A, B)]. In that case, mapValues operates on the value only (the second part of the tuple), while map operates on the entire record (tuple of key and value).

In other words, given f: B => C and rdd: RDD[(A, B)], these two are identical (almost - see comment at the bottom):

val result: RDD[(A, C)] = rdd.map { case (k, v) => (k, f(v)) }  val result: RDD[(A, C)] = rdd.mapValues(f)

The latter is simply shorter and clearer, so when you just want to transform the values and keep the keys as-is, it's recommended to use mapValues.

On the other hand, if you want to transform the keys too (e.g. you want to apply f: (A, B) => C), you simply can't use mapValues because it would only pass the values to your function.

The last difference concerns partitioning: if you applied any custom partitioning to your RDD (e.g. using partitionBy), using map would "forget" that paritioner (the result will revert to default partitioning) as the keys might have changed; mapValues, however, preserves any partitioner set on the RDD.

170

answered Sep 30 '22 16:09

Tzach Zohar

When we use map() with a Pair RDD, we get access to both Key & value. few times we are only interested in accessing the value(& not key). In those case, we can use mapValues() instead of map().

Example of mapValues

val inputrdd = sc.parallelize(Seq(("maths", 50), ("maths", 60), ("english", 65))) val mapped = inputrdd.mapValues(mark => (mark, 1));  // val reduced = mapped.reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2))  reduced.collect

Array[(String, (Int, Int))] = Array((english,(65,1)), (maths,(110,2)))

val average = reduced.map { x =>                            val temp = x._2                            val total = temp._1                            val count = temp._2                            (x._1, total / count)                            }  average.collect()

res1: Array[(String, Int)] = Array((english,65), (maths,55))

answered Sep 30 '22 16:09

vaquar khan

Related questions
                            
                                In Scala, how to read a simple CSV file having a header in its first line?
                            
                                Scala Macros: Making a Map out of fields of a class in Scala
                            
                                How to use synchronized in Scala?
                            
                                How do I call a Scala Object method using reflection?
                            
                                Debugging functional code in Scala
                            
                                "object index is not a member of package views.html" when opening scala play project in scala ide
                            
                                Scala pattern matching confusion with Option[Any]
                            
                                How to use IntelliJ with Play Framework and Scala
                            
                                Scala: method\operator overloading
                            
                                How to talk about companion objects vs regular objects?
                            
                                Date conversion
                            
                                Adding two columns to existing DataFrame using withColumn
                            
                                How do I ignore an exception?
                            
                                Scala client composition with Traits vs implementing an abstract class
                            
                                How to read a file from classpath without external dependencies?
                            
                                Difference between Option(value) and Some(value)
                            
                                How to create SparkSession from existing SparkContext
                            
                                Multiple assignment of non-tuples in scala
                            
                                How to sort an RDD in Scala Spark?
                            
                                Best practices for mixing in Scala concurrent.Map

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With