I ran into this line in the Apache Spark code source <pre class="prettyprint"><code>val (gradientSum, lossSum, miniBatchSize) = data .sample(false, miniBatchFraction, 42 + i) .treeAggregate((BDV.zeros[Double](n), 0.0, 0L))( seqOp = (c, v) => { // c: (grad, loss, count), v: (label, features) val l = gradient.compute(v._2, v._1, bcWeights.value, Vectors.fromBreeze(c._1)) (c._1, c._2 + l, c._3 + 1) }, combOp = (c1, c2) => { // c: (grad, loss, count) (c1._1 += c2._1, c1._2 + c2._2, c1._3 + c2._3) } ) </code></pre> I have multiple trouble reading this : <ul> <li>First I can't find anything on the web that explains exactly how <code>treeAggregate</code> works, what are the meaning of the params. </li> <li>Second, here <code>.treeAggregate</code> seems to have two ()() following the method name. What could that mean? Is that some special scala syntax that I don't understand. </li> <li>Finally, I see both seqOp and comboOp return a 3 element tuple which match the expected left hand side variable, but which one actually gets returned? </li> </ul> This statement must be really advanced. I can't begin to decipher this.

<code>treeAggregate</code> is a specialized implementation of <code>aggregate</code> that iteratively applies the combine function to a subset of partitions. This is done in order to prevent returning all partial results to the driver where a single pass reduce would take place as the classic <code>aggregate</code> does. For all practical purposes, <code>treeAggregate</code> follows the same principle as <code>aggregate</code> explained in this answer: Explain the aggregate functionality in Python with the exception that it takes an extra parameter to indicate the depth of the partial aggregation level. Let me try to explain what's going on here specifically: For aggregate, we need a zero, a combiner function and a reduce function. <code>aggregate</code> uses currying to specify the zero value independently of the combine and reduce functions. We can then dissect the above function like this . Hopefully that helps understanding: <pre class="prettyprint"><code>val Zero: (BDV, Double, Long) = (BDV.zeros[Double](n), 0.0, 0L) val combinerFunction: ((BDV, Double, Long), (??, ??)) => (BDV, Double, Long) = (c, v) => { // c: (grad, loss, count), v: (label, features) val l = gradient.compute(v._2, v._1, bcWeights.value, Vectors.fromBreeze(c._1)) (c._1, c._2 + l, c._3 + 1) val reducerFunction: ((BDV, Double, Long),(BDV, Double, Long)) => (BDV, Double, Long) = (c1, c2) => { // c: (grad, loss, count) (c1._1 += c2._1, c1._2 + c2._2, c1._3 + c2._3) } </code></pre> Then we can rewrite the call to <code>treeAggregate</code> in a more digestable form: <pre class="prettyprint"><code>val (gradientSum, lossSum, miniBatchSize) = treeAggregate(Zero)(combinerFunction, reducerFunction) </code></pre> This form will 'extract' the resulting tuple into the named values <code>gradientSum, lossSum, miniBatchSize</code> for further usage. Note that <code>treeAggregate</code> takes an additional parameter <code>depth</code> which is declared with a default value <code>depth = 2</code>, thus, as it's not provided in this particular call, it will take that default value.

how to interpret RDD.treeAggregate

Tags:

scala

distributed-computing

apache-spark

rdd

I ran into this line in the Apache Spark code source

val (gradientSum, lossSum, miniBatchSize) = data
    .sample(false, miniBatchFraction, 42 + i)
    .treeAggregate((BDV.zeros[Double](n), 0.0, 0L))(
      seqOp = (c, v) => {
        // c: (grad, loss, count), v: (label, features)
        val l = gradient.compute(v._2, v._1, bcWeights.value, Vectors.fromBreeze(c._1))
        (c._1, c._2 + l, c._3 + 1)
      },
      combOp = (c1, c2) => {
        // c: (grad, loss, count)
        (c1._1 += c2._1, c1._2 + c2._2, c1._3 + c2._3)
      }
    )

I have multiple trouble reading this :

First I can't find anything on the web that explains exactly how treeAggregate works, what are the meaning of the params.
Second, here .treeAggregate seems to have two ()() following the method name. What could that mean? Is that some special scala syntax that I don't understand.
Finally, I see both seqOp and comboOp return a 3 element tuple which match the expected left hand side variable, but which one actually gets returned?

This statement must be really advanced. I can't begin to decipher this.

434

asked Apr 25 '15 03:04

bhomass

1 Answers

treeAggregate is a specialized implementation of aggregate that iteratively applies the combine function to a subset of partitions. This is done in order to prevent returning all partial results to the driver where a single pass reduce would take place as the classic aggregate does.

For all practical purposes, treeAggregate follows the same principle as aggregate explained in this answer: Explain the aggregate functionality in Python with the exception that it takes an extra parameter to indicate the depth of the partial aggregation level.

Let me try to explain what's going on here specifically:

For aggregate, we need a zero, a combiner function and a reduce function. aggregate uses currying to specify the zero value independently of the combine and reduce functions.

We can then dissect the above function like this . Hopefully that helps understanding:

val Zero: (BDV, Double, Long) = (BDV.zeros[Double](n), 0.0, 0L)
val combinerFunction: ((BDV, Double, Long), (??, ??)) => (BDV, Double, Long)  =  (c, v) => {
        // c: (grad, loss, count), v: (label, features)
        val l = gradient.compute(v._2, v._1, bcWeights.value, Vectors.fromBreeze(c._1))
        (c._1, c._2 + l, c._3 + 1)
val reducerFunction: ((BDV, Double, Long),(BDV, Double, Long)) => (BDV, Double, Long) = (c1, c2) => {
        // c: (grad, loss, count)
        (c1._1 += c2._1, c1._2 + c2._2, c1._3 + c2._3)
      }

Then we can rewrite the call to treeAggregate in a more digestable form:

val (gradientSum, lossSum, miniBatchSize) = treeAggregate(Zero)(combinerFunction, reducerFunction)

This form will 'extract' the resulting tuple into the named values gradientSum, lossSum, miniBatchSize for further usage.

Note that treeAggregate takes an additional parameter depth which is declared with a default value depth = 2, thus, as it's not provided in this particular call, it will take that default value.

146

answered Oct 15 '22 19:10

maasg

Related questions
                            
                                DBSCAN on spark : which implementation
                            
                                DATE_SUB and DATE_ADD in H2 for MySQL
                            
                                What is the difference between Abstract Data Types and Algebraic Data Types
                            
                                Spark: Is "count" on Grouped Data a Transformation or an Action?
                            
                                Scala, generic tuple
                            
                                How to curry a function in Scala
                            
                                scala: 'def foo = {1}' vs 'def foo {1}'
                            
                                Scala: Why are Actors lightweight?
                            
                                Does Scala have an operator similar to Haskell's `$`?
                            
                                Stable identifier required during pattern matching? (Scala)
                            
                                How to get payload from a POST in Play 2.0
                            
                                Scala SWT project with SBT
                            
                                mongodb database with scala play 2.0 tutorial
                            
                                Scala: List of pairs to pair of lists
                            
                                Is this the latest version of the maven scala plugin ?
                            
                                What do multiple, consecutive fat arrows in method parameters mean in Scala?
                            
                                Scala vs Java performance (HashSet and bigram generation)
                            
                                How to split an inbound stream on a delimiter character using Akka Streams
                            
                                Custom JSON validation constraints in Play Framework 2.3 (Scala)
                            
                                Scala collections: why do we need a case statement to extract values tuples in higher order functions?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With