How to compute cumulative sum using Spark

Tags:

scala

apache-spark

I have an rdd of (String,Int) which is sorted by key

val data = Array(("c1",6), ("c2",3),("c3",4))
val rdd = sc.parallelize(data).sortByKey

Now I want to start the value for the first key with zero and the subsequent keys as sum of the previous keys.

Eg: c1 = 0 , c2 = c1's value , c3 = (c1 value +c2 value) , c4 = (c1+..+c3 value) expected output:

(c1,0), (c2,6), (c3,9)...

Is it possible to achieve this ? I tried it with map but the sum is not preserved inside the map.

var sum = 0 ;
val t = keycount.map{ x => { val temp = sum; sum = sum + x._2 ; (x._1,temp); }}

620

asked Feb 02 '16 13:02

Knight71

1 Answers

Compute partial results for each partition:

val partials = rdd.mapPartitionsWithIndex((i, iter) => {
  val (keys, values) = iter.toSeq.unzip
  val sums  = values.scanLeft(0)(_ + _)
  Iterator((keys.zip(sums.tail), sums.last))
})

Collect partials sums

val partialSums = partials.values.collect

Compute cumulative sum over partitions and broadcast it:

val sumMap = sc.broadcast(
  (0 until rdd.partitions.size)
    .zip(partialSums.scanLeft(0)(_ + _))
    .toMap
)

Compute final results:

val result = partials.keys.mapPartitionsWithIndex((i, iter) => {
  val offset = sumMap.value(i)
  if (iter.isEmpty) Iterator()
  else iter.next.map{case (k, v) => (k, v + offset)}.toIterator
})

answered Sep 30 '22 02:09

zero323

Related questions
                            
                                How to convert an anonymous function to a method value?
                            
                                Why asInstanceOf doesn't throw a ClassCastException?
                            
                                Why does Spark RDD partition has 2GB limit for HDFS?
                            
                                What advantages does Ceylon have over Java or Scala [closed]
                            
                                What does the keyword 'implicit' mean when it's placed in front of lambda expression parameter?
                            
                                What are the differences between the type inference of Scala and C++11?
                            
                                How to convert JSON to a type in Scala
                            
                                How do you run only a single Spec2 specification with SBT?
                            
                                Scala build crashed
                            
                                If the Nothing type is at the bottom of the class hierarchy, why can I not call any conceivable method on it?
                            
                                Nearest keys in a SortedMap
                            
                                Is there a way to "enrich" a Scala class without wrapping the code into another object?
                            
                                How to force Play framework 2 to always use SSL?
                            
                                Are there any plugins for generating API documentation for Play 2.x?
                            
                                How to set up asset fingerprinting in Play 2.3.4?
                            
                                How to skip optional parameters in Scala?
                            
                                How can I delegate to a member in Scala?
                            
                                What library should I use for accessing Riak from Scala?
                            
                                Time complexity of JavaConverters asScala method
                            
                                Taming the Scala type system

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With