Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Distributed Map in Scala Spark

Does Spark support distributed Map collection types ?

So if I have an HashMap[String,String] which are key,value pairs , can this be converted to a distributed Map collection type ? To access the element I could use "filter" but I doubt this performs as well as Map ?

like image 906
blue-sky Avatar asked Jul 13 '14 16:07

blue-sky


People also ask

What is map in Spark Scala?

Spark map() is a transformation operation that is used to apply the transformation on every element of RDD, DataFrame, and Dataset and finally returns a new RDD/Dataset respectively. In this article, you will learn the syntax and usage of the map() transformation with an RDD & DataFrame example.

What are RDDs in Spark?

RDDs are the main logical data units in Spark. They are a distributed collection of objects, which are stored in memory or on disks of different machines of a cluster. A single RDD can be divided into multiple logical partitions so that these partitions can be stored and processed on different machines of a cluster.

What is the function of the map () in Spark?

Spark Map function takes one element as input process it according to custom code (specified by the developer) and returns one element at a time. Map transforms an RDD of length N into another RDD of length N. The input and output RDDs will typically have the same number of records.

How many ways can you create RDD in Spark?

There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.


2 Answers

Since I found some new info I thought I'd turn my comments into an answer. @maasg already covered the standard lookup function I would like to point out you should be careful because if the RDD's partitioner is None, lookup just uses a filter anyway. In reference to the (K,V) store on top of spark it looks like this is in progress, but a usable pull request has been made here. Here is an example usage.

import org.apache.spark.rdd.IndexedRDD

// Create an RDD of key-value pairs with Long keys.
val rdd = sc.parallelize((1 to 1000000).map(x => (x.toLong, 0)))
// Construct an IndexedRDD from the pairs, hash-partitioning and indexing
// the entries.
val indexed = IndexedRDD(rdd).cache()

// Perform a point update.
val indexed2 = indexed.put(1234L, 10873).cache()
// Perform a point lookup. Note that the original IndexedRDD remains
// unmodified.
indexed2.get(1234L) // => Some(10873)
indexed.get(1234L) // => Some(0)

// Efficiently join derived IndexedRDD with original.
val indexed3 = indexed.innerJoin(indexed2) { (id, a, b) => b }.filter(_._2 != 0)
indexed3.collect // => Array((1234L, 10873))

// Perform insertions and deletions.
val indexed4 = indexed2.put(-100L, 111).delete(Array(998L, 999L)).cache()
indexed2.get(-100L) // => None
indexed4.get(-100L) // => Some(111)
indexed2.get(999L) // => Some(0)
indexed4.get(999L) // => None

It seems like the pull request was well received and will probably be included in future versions of spark, so it is probably safe to use that pull request in your own code. Here is the JIRA ticket in case you were curious

like image 54
aaronman Avatar answered Sep 22 '22 14:09

aaronman


The quick answer: Partially.

You can transform a Map[A,B] into an RDD[(A,B)] by first forcing the map into a sequence of (k,v) pairs but by doing so you loose the constrain that keys of a map must be a set. ie. you loose the semantics of the Map structure.

From a practical perspective, you can still resolve an element into its corresponding value using kvRdd.lookup(element) but the result will be a sequence, given that you have no warranties that there's a single lookup value as explained before.

A spark-shell example to make things clear:

val englishNumbers = Map(1 -> "one", 2 ->"two" , 3 -> "three")
val englishNumbersRdd = sc.parallelize(englishNumbers.toSeq)

englishNumbersRdd.lookup(1)
res: Seq[String] = WrappedArray(one) 

val spanishNumbers = Map(1 -> "uno", 2 -> "dos", 3 -> "tres")
val spanishNumbersRdd = sc.parallelize(spanishNumbers.toList)

val bilingueNumbersRdd = englishNumbersRdd union spanishNumbersRdd

bilingueNumbersRdd.lookup(1)
res: Seq[String] = WrappedArray(one, uno)
like image 23
maasg Avatar answered Sep 19 '22 14:09

maasg