Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Access key from mapValues or flatMapValues?

In Spark 1.3, is there a way to access the key from mapValues?

Specifically, if I have

val y = x.groupBy(someKey)
val z = y.mapValues(someFun)

can someFun know which key of y it is currently operating on?

Or do I have to do

val y = x.map(r => (someKey(r), r)).groupBy(_._1)
val z = y.mapValues{ case (k, r) => someFun(r, k) }

Note: the reason I want to use mapValues rather than map is to preserve the partitioning.

like image 948
mitchus Avatar asked Jun 15 '15 11:06

mitchus


People also ask

What is the difference between map values and flatMap values?

mapValues maps the values while keeping the keys. notice that for key-value pair (3, 6), it produces (3,Range ()) since 6 to 5 produces an empty collection of values. flatMap "breaks down" collections into the elements of the collection.

How do I get the key-value pair of a map?

Thus, in most cases, you'll want to get the key-value pair together. The entrySet () method returns a set of Map.Entry<K, V> objects that reside in the map. You can easily iterate over this set to get the keys and their associated values from a map.

What is the difference between mapvalues and flatMap in RDD?

Let's start with the given rdd. mapValues maps the values while keeping the keys. notice that for key-value pair (3, 6), it produces (3,Range ()) since 6 to 5 produces an empty collection of values. flatMap "breaks down" collections into the elements of the collection.

Can a map have multiple keys and values?

Duplicate keys are not allowed and each key can have at most one value in a map. Iterating over keys or values (or both) of a Map object is a pretty common use case and one that developers have to do every so often. Fortunately, the Map interface provides three collection views, which allow a map’s contents to be viewed:


2 Answers

In this case you can use mapPartitions with the preservesPartitioning attribute.

x.mapPartitions((it => it.map { case (k,rr) => (k, someFun(rr, k)) }), preservesPartitioning = true)

You just have to make sure you are not changing the partitioning, i.e. don't change the key.

like image 152
Marius Soutier Avatar answered Sep 23 '22 05:09

Marius Soutier


You can't use the key with mapValues. But you can preserve the partitioning with the mapPartitions.

val pairs: Rdd[(Int, Int)] = ???
pairs.mapPartitions({ it =>
  it.map { case (k, v) =>
    // your code
  }
}, preservesPartitioning = true)

Be careful to actually preserve the partitioning, the compiler will not be able to check it.

like image 25
Lomig Mégard Avatar answered Sep 23 '22 05:09

Lomig Mégard