I'm trying to learn to use DataFrames and DataSets more in addition to RDDs. For an RDD, I know I can do <code>someRDD.reduceByKey((x,y) => x + y)</code>, but I don't see that function for Dataset. So I decided to write one. <pre class="prettyprint"><code>someRdd.map(x => ((x.fromId,x.toId),1)).map(x => collection.mutable.Map(x)).reduce((x,y) => { val result = mutable.HashMap.empty[(Long,Long),Int] val keys = mutable.HashSet.empty[(Long,Long)] y.keys.foreach(z => keys += z) x.keys.foreach(z => keys += z) for (elem <- keys) { val s1 = if(x.contains(elem)) x(elem) else 0 val s2 = if(y.contains(elem)) y(elem) else 0 result(elem) = s1 + s2 } result }) </code></pre> However, this returns everything to the driver. How would you write this to return a <code>Dataset</code>? Maybe mapPartition and do it there? Note this compiles but does not run because it doesn't have encoders for <code>Map</code> yet

A more efficient solution uses <code>mapPartitions</code> before <code>groupByKey</code> to reduce the amount of shuffling (note this is not the exact same signature as <code>reduceByKey</code> but I think it is more flexible to pass a function than require the dataset consist of a tuple). <pre class="prettyprint"><code>def reduceByKey[V: ClassTag, K](ds: Dataset[V], f: V => K, g: (V, V) => V) (implicit encK: Encoder[K], encV: Encoder[V]): Dataset[(K, V)] = { def h[V: ClassTag, K](f: V => K, g: (V, V) => V, iter: Iterator[V]): Iterator[V] = { iter.toArray.groupBy(f).mapValues(_.reduce(g)).map(_._2).toIterator } ds.mapPartitions(h(f, g, _)) .groupByKey(f)(encK) .reduceGroups(g) } </code></pre> Depending on the shape/size of your data, this is within 1 second of the performance of <code>reduceByKey</code>, and about <code>2x</code> as fast as a <code>groupByKey(_._1).reduceGroups</code>. There is still room for improvements, so suggestions would be welcome.

Rolling your own reduceByKey in Spark Dataset

Tags:

scala

apache-spark

mapreduce

I'm trying to learn to use DataFrames and DataSets more in addition to RDDs. For an RDD, I know I can do someRDD.reduceByKey((x,y) => x + y), but I don't see that function for Dataset. So I decided to write one.

someRdd.map(x => ((x.fromId,x.toId),1)).map(x => collection.mutable.Map(x)).reduce((x,y) => {   val result = mutable.HashMap.empty[(Long,Long),Int]   val keys = mutable.HashSet.empty[(Long,Long)]   y.keys.foreach(z => keys += z)   x.keys.foreach(z => keys += z)   for (elem <- keys) {     val s1 = if(x.contains(elem)) x(elem) else 0     val s2 = if(y.contains(elem)) y(elem) else 0     result(elem) = s1 + s2   }   result })

However, this returns everything to the driver. How would you write this to return a Dataset? Maybe mapPartition and do it there?

Note this compiles but does not run because it doesn't have encoders for Map yet

349

asked Jul 14 '16 19:07

Carlos Bribiescas

2 Answers

I assume your goal is to translate this idiom to Datasets:

rdd.map(x => (x.someKey, x.someField))    .reduceByKey(_ + _)  // => returning an RDD of (KeyType, FieldType)

Currently, the closest solution I have found with the Dataset API looks like this:

ds.map(x => (x.someKey, x.someField))          // [1]   .groupByKey(_._1)                               .reduceGroups((a, b) => (a._1, a._2 + b._2))   .map(_._2)                                   // [2]  // => returning a Dataset of (KeyType, FieldType)  // Comments: // [1] As far as I can see, having a map before groupByKey is required //     to end up with the proper type in reduceGroups. After all, we do //     not want to reduce over the original type, but the FieldType. // [2] required since reduceGroups converts back to Dataset[(K, V)] //     not knowing that our V's are already key-value pairs.

Doesn't look very elegant and according to a quick benchmark it is also much less performant, so maybe we are missing something here...

Note: An alternative might be to use groupByKey(_.someKey) as a first step. The problem is that using groupByKey changes the type from a regular Dataset to a KeyValueGroupedDataset. The latter does not have a regular map function. Instead it offers an mapGroups, which does not seem very convenient because it wraps the values into an Iterator and performs a shuffle according to the docstring.

answered Sep 21 '22 11:09

bluenote10

A more efficient solution uses mapPartitions before groupByKey to reduce the amount of shuffling (note this is not the exact same signature as reduceByKey but I think it is more flexible to pass a function than require the dataset consist of a tuple).

def reduceByKey[V: ClassTag, K](ds: Dataset[V], f: V => K, g: (V, V) => V)   (implicit encK: Encoder[K], encV: Encoder[V]): Dataset[(K, V)] = {   def h[V: ClassTag, K](f: V => K, g: (V, V) => V, iter: Iterator[V]): Iterator[V] = {     iter.toArray.groupBy(f).mapValues(_.reduce(g)).map(_._2).toIterator   }   ds.mapPartitions(h(f, g, _))     .groupByKey(f)(encK)     .reduceGroups(g) }

Depending on the shape/size of your data, this is within 1 second of the performance of reduceByKey, and about 2x as fast as a groupByKey(_._1).reduceGroups. There is still room for improvements, so suggestions would be welcome.

answered Sep 19 '22 11:09

Justin Raymond

Related questions
                            
                                SBT install failure with aptitude on Ubuntu 14.04
                            
                                What is the proper way to code a read-while loop in Scala?
                            
                                What is Scala's counterpart of Discriminated Union in F#?
                            
                                Initializing a 2D (multi-dimensional) array in Scala
                            
                                Typesafe Swing events—"The outer reference in this type test cannot be checked at run time"
                            
                                Getting Value of Either
                            
                                How to use Environment Variables in build.sbt?
                            
                                How to build a multimap from a list of tuples in Scala?
                            
                                Why doesn't Scala have an IO Monad?
                            
                                In Scala, how would you declare static data inside a function?
                            
                                What is a DList?
                            
                                Parallel execution of tests
                            
                                Should I override the default ExecutionContext?
                            
                                How do you call a Scala singleton method from Java?
                            
                                Is there something wrong with an abstract value used in trait in scala?
                            
                                Condition in map function
                            
                                How to calculate sum and count in a single groupBy?
                            
                                Scala Map#get and the return of Some()
                            
                                Using Java libraries in Scala
                            
                                Scala IO monad: what's the point?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With