Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

map vs mapValues in Spark

I'm currently learning Spark and developing custom machine learning algorithms. My question is what is the difference between .map() and .mapValues() and what are cases where I clearly have to use one instead of the other?

like image 357
jtitusj Avatar asked Apr 18 '16 14:04

jtitusj


People also ask

What does Mapvalues do in PySpark?

Pass each value in the key-value pair RDD through a map function without changing the keys; this also retains the original RDD's partitioning.

What is map RDD?

PySpark map ( map() ) is an RDD transformation that is used to apply the transformation function (lambda) on every element of RDD/DataFrame and returns a new RDD. In this article, you will learn the syntax and usage of the RDD map() transformation with an example and how to use it with DataFrame.

Is map a transformation in spark?

A map is a transformation operation in Apache Spark. It applies to each element of RDD and it returns the result as new RDD.

What is spark reduceByKey?

Spark RDD reduceByKey() transformation is used to merge the values of each key using an associative reduce function. It is a wider transformation as it shuffles data across multiple partitions and it operates on pair RDD (key/value pair).

What is the difference between map and mapvalues in Python?

In that case, mapValues operates on the value only (the second part of the tuple), while map operates on the entire record (tuple of key and value). In other words, given f: B => C and rdd: RDD [ (A, B)], these two are identical

What is the difference between map and mapvalues in RDD?

There is a difference between the two: mapValues is only applicable for PairRDDs, meaning RDDs of the form RDD [ (A, B)]. In that case, mapValues operates on the value only (the second part of the tuple), while map operates on the entire record (tuple of key and value).

What is the difference between map() and mappartitions() in spark?

mapPartitions () – This is exactly the same as map (); the difference being, Spark mapPartitions () provides a facility to do heavy initializations (for example Database connection) once for each partition instead of doing it on every DataFrame row.

What is the difference between map() and flatMap() transformations in spark?

Both map () & flatMap () returns Dataset (DataFrame=Dataset [Row]). Both these transformations are narrow meaning they do not result in Spark Data Shuffle. flatMap () results in redundant data on some columns. One of the use cases of flatMap () is to flatten column which contains arrays, list, or any nested collection (one cell with one value).


2 Answers

mapValues is only applicable for PairRDDs, meaning RDDs of the form RDD[(A, B)]. In that case, mapValues operates on the value only (the second part of the tuple), while map operates on the entire record (tuple of key and value).

In other words, given f: B => C and rdd: RDD[(A, B)], these two are identical (almost - see comment at the bottom):

val result: RDD[(A, C)] = rdd.map { case (k, v) => (k, f(v)) }  val result: RDD[(A, C)] = rdd.mapValues(f) 

The latter is simply shorter and clearer, so when you just want to transform the values and keep the keys as-is, it's recommended to use mapValues.

On the other hand, if you want to transform the keys too (e.g. you want to apply f: (A, B) => C), you simply can't use mapValues because it would only pass the values to your function.

The last difference concerns partitioning: if you applied any custom partitioning to your RDD (e.g. using partitionBy), using map would "forget" that paritioner (the result will revert to default partitioning) as the keys might have changed; mapValues, however, preserves any partitioner set on the RDD.

like image 170
Tzach Zohar Avatar answered Sep 30 '22 16:09

Tzach Zohar


When we use map() with a Pair RDD, we get access to both Key & value. few times we are only interested in accessing the value(& not key). In those case, we can use mapValues() instead of map().

Example of mapValues

val inputrdd = sc.parallelize(Seq(("maths", 50), ("maths", 60), ("english", 65))) val mapped = inputrdd.mapValues(mark => (mark, 1));  // val reduced = mapped.reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2))  reduced.collect 

Array[(String, (Int, Int))] = Array((english,(65,1)), (maths,(110,2)))

val average = reduced.map { x =>                            val temp = x._2                            val total = temp._1                            val count = temp._2                            (x._1, total / count)                            }  average.collect() 

res1: Array[(String, Int)] = Array((english,65), (maths,55))

like image 37
vaquar khan Avatar answered Sep 30 '22 16:09

vaquar khan