I have code to compute Within Set Sum of Squared Error after clustering which I mostly took from the Spark mllib source code. When I run the analogous code using the spark API it runs in many different (distributed) jobs and runs successfully. When I run it my code (which should be doing the same thing as the Spark code) I get a stack overflow error. Any ideas why? Here is the code: <pre class="prettyprint"><code>import java.util.Arrays import org.apache.spark.mllib.linalg.{Vectors, Vector} import org.apache.spark.mllib.linalg._ import org.apache.spark.mllib.linalg.distributed.RowMatrix import org.apache.spark.rdd.RDD import org.apache.spark.api.java.JavaRDD import breeze.linalg.{axpy => brzAxpy, inv, svd => brzSvd, DenseMatrix => BDM, DenseVector => BDV, MatrixSingularException, SparseVector => BSV, CSCMatrix => BSM, Matrix => BM} val EPSILON = { var eps = 1.0 while ((1.0 + (eps / 2.0)) != 1.0) { eps /= 2.0 } eps } def dot(x: Vector, y: Vector): Double = { require(x.size == y.size, "BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching sizes:" + " x.size = " + x.size + ", y.size = " + y.size) (x, y) match { case (dx: DenseVector, dy: DenseVector) => dot(dx, dy) case (sx: SparseVector, dy: DenseVector) => dot(sx, dy) case (dx: DenseVector, sy: SparseVector) => dot(sy, dx) case (sx: SparseVector, sy: SparseVector) => dot(sx, sy) case _ => throw new IllegalArgumentException(s"dot doesn't support (${x.getClass}, ${y.getClass}).") } } def fastSquaredDistance( v1: Vector, norm1: Double, v2: Vector, norm2: Double, precision: Double = 1e-6): Double = { val n = v1.size require(v2.size == n) require(norm1 >= 0.0 && norm2 >= 0.0) val sumSquaredNorm = norm1 * norm1 + norm2 * norm2 val normDiff = norm1 - norm2 var sqDist = 0.0 /* * The relative error is * <pre> * EPSILON * ( \|a\|_2^2 + \|b\\_2^2 + 2 |a^T b|) / ( \|a - b\|_2^2 ), * </pre> * which is bounded by * <pre> * 2.0 * EPSILON * ( \|a\|_2^2 + \|b\|_2^2 ) / ( (\|a\|_2 - \|b\|_2)^2 ). * </pre> * The bound doesn't need the inner product, so we can use it as a sufficient condition to * check quickly whether the inner product approach is accurate. */ val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON) if (precisionBound1 < precision) { sqDist = sumSquaredNorm - 2.0 * dot(v1, v2) } else if (v1.isInstanceOf[SparseVector] || v2.isInstanceOf[SparseVector]) { val dotValue = dot(v1, v2) sqDist = math.max(sumSquaredNorm - 2.0 * dotValue, 0.0) val precisionBound2 = EPSILON * (sumSquaredNorm + 2.0 * math.abs(dotValue)) / (sqDist + EPSILON) if (precisionBound2 > precision) { sqDist = Vectors.sqdist(v1, v2) } } else { sqDist = Vectors.sqdist(v1, v2) } sqDist } def findClosest( centers: TraversableOnce[Vector], point: Vector): (Int, Double) = { var bestDistance = Double.PositiveInfinity var bestIndex = 0 var i = 0 centers.foreach { center => // Since `\|a - b\| \geq |\|a\| - \|b\||`, we can use this lower bound to avoid unnecessary // distance computation. var lowerBoundOfSqDist = Vectors.norm(center, 2.0) - Vectors.norm(point, 2.0) lowerBoundOfSqDist = lowerBoundOfSqDist * lowerBoundOfSqDist if (lowerBoundOfSqDist < bestDistance) { val distance: Double = fastSquaredDistance(center, Vectors.norm(center, 2.0), point, Vectors.norm(point, 2.0)) if (distance < bestDistance) { bestDistance = distance bestIndex = i } } i += 1 } (bestIndex, bestDistance) } def pointCost( centers: TraversableOnce[Vector], point: Vector): Double = findClosest(centers, point)._2 def clusterCentersIter: Iterable[Vector] = clusterCenters.map(p => p) def computeCostZep(indata: RDD[Vector]): Double = { val bcCenters = indata.context.broadcast(clusterCenters) indata.map(p => pointCost(bcCenters.value, p)).sum() } computeCostZep(projectedData) </code></pre> I believe I am using all of the same parallelization jobs as spark, but it doesn't work for me. Any advice at making my code distributed/helping see why memory overflows happen in my code would be very helpful Here is a link to the source code in spark which is very similar: KMeansModel and KMeans and this is the code which does run fine: <pre class="prettyprint"><code>val clusters = KMeans.train(projectedData, numClusters, numIterations) val clusterCenters = clusters.clusterCenters // Evaluate clustering by computing Within Set Sum of Squared Errors val WSSSE = clusters.computeCost(projectedData) println("Within Set Sum of Squared Errors = " + WSSSE) </code></pre> Here is the error output: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 94.0 failed 4 times, most recent failure: Lost task 1.3 in stage 94.0 (TID 37663, ip-172-31-13-209.ec2.internal): java.lang.StackOverflowError at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$$$$$c57ec8bf9b0d5f6161b97741d596ff0$$$$wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.dot(:226) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$$$$$c57ec8bf9b0d5f6161b97741d596ff0$$$$wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.dot(:226) ... and later down: Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952) at org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1088) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) at org.apache.spark.rdd.RDD.fold(RDD.scala:1082) at org.apache.spark.rdd.DoubleRDDFunctions$$anonfun$sum$1.apply$mcD$sp(DoubleRDDFunctions.scala:34) at org.apache.spark.rdd.DoubleRDDFunctions$$anonfun$sum$1.apply(DoubleRDDFunctions.scala:34) at org.apache.spark.rdd.DoubleRDDFunctions$$anonfun$sum$1.apply(DoubleRDDFunctions.scala:34) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) at org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)

Seems pretty straightforward what is happening: you are recursively invoking the <code>dot</code> method here: <pre class="prettyprint"><code>def dot(x: Vector, y: Vector): Double = { require(x.size == y.size, "BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching sizes:" + " x.size = " + x.size + ", y.size = " + y.size) (x, y) match { case (dx: DenseVector, dy: DenseVector) => dot(dx, dy) case (sx: SparseVector, dy: DenseVector) => dot(sx, dy) case (dx: DenseVector, sy: SparseVector) => dot(sy, dx) case (sx: SparseVector, sy: SparseVector) => dot(sx, sy) case _ => throw new IllegalArgumentException(s"dot doesn't support (${x.getClass}, ${y.getClass}).") } } </code></pre> The succeeding recursive invocations to <code>dot</code> are with the same arguments as the former - therefore there is never a conclusion to the recursion. The stacktrace tells you that as well - notice the location is at the dot method: java.lang.StackOverflowError at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$$$$$c57ec8bf9b0d5f6161b97741d596ff0$$$$wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.dot(:226) at

How to parallelize Spark scala computation?

Tags:

scala

apache-spark

apache-spark-mllib

I have code to compute Within Set Sum of Squared Error after clustering which I mostly took from the Spark mllib source code.

When I run the analogous code using the spark API it runs in many different (distributed) jobs and runs successfully. When I run it my code (which should be doing the same thing as the Spark code) I get a stack overflow error. Any ideas why?

Here is the code:

import java.util.Arrays
        import org.apache.spark.mllib.linalg.{Vectors, Vector}
        import org.apache.spark.mllib.linalg._
        import org.apache.spark.mllib.linalg.distributed.RowMatrix
        import org.apache.spark.rdd.RDD
        import org.apache.spark.api.java.JavaRDD
        import breeze.linalg.{axpy => brzAxpy, inv, svd => brzSvd, DenseMatrix => BDM, DenseVector => BDV,
          MatrixSingularException, SparseVector => BSV, CSCMatrix => BSM, Matrix => BM}

        val EPSILON = {
            var eps = 1.0
            while ((1.0 + (eps / 2.0)) != 1.0) {
              eps /= 2.0
            }
            eps
          }

        def dot(x: Vector, y: Vector): Double = {
            require(x.size == y.size,
              "BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching sizes:" +
              " x.size = " + x.size + ", y.size = " + y.size)
            (x, y) match {
              case (dx: DenseVector, dy: DenseVector) =>
                dot(dx, dy)
              case (sx: SparseVector, dy: DenseVector) =>
                dot(sx, dy)
              case (dx: DenseVector, sy: SparseVector) =>
                dot(sy, dx)
              case (sx: SparseVector, sy: SparseVector) =>
                dot(sx, sy)
              case _ =>
                throw new IllegalArgumentException(s"dot doesn't support (${x.getClass}, ${y.getClass}).")
            }
         }

         def fastSquaredDistance(
              v1: Vector,
              norm1: Double,
              v2: Vector,
              norm2: Double,
              precision: Double = 1e-6): Double = {
            val n = v1.size
            require(v2.size == n)
            require(norm1 >= 0.0 && norm2 >= 0.0)
            val sumSquaredNorm = norm1 * norm1 + norm2 * norm2
            val normDiff = norm1 - norm2
            var sqDist = 0.0
            /*
             * The relative error is
             * <pre>
             * EPSILON * ( \|a\|_2^2 + \|b\\_2^2 + 2 |a^T b|) / ( \|a - b\|_2^2 ),
             * </pre>
             * which is bounded by
             * <pre>
             * 2.0 * EPSILON * ( \|a\|_2^2 + \|b\|_2^2 ) / ( (\|a\|_2 - \|b\|_2)^2 ).
             * </pre>
             * The bound doesn't need the inner product, so we can use it as a sufficient condition to
             * check quickly whether the inner product approach is accurate.
             */
            val precisionBound1 = 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON)
            if (precisionBound1 < precision) {
              sqDist = sumSquaredNorm - 2.0 * dot(v1, v2)
            } else if (v1.isInstanceOf[SparseVector] || v2.isInstanceOf[SparseVector]) {
              val dotValue = dot(v1, v2)
              sqDist = math.max(sumSquaredNorm - 2.0 * dotValue, 0.0)
              val precisionBound2 = EPSILON * (sumSquaredNorm + 2.0 * math.abs(dotValue)) /
                (sqDist + EPSILON)
              if (precisionBound2 > precision) {
                sqDist = Vectors.sqdist(v1, v2)
              }
            } else {
              sqDist = Vectors.sqdist(v1, v2)
            }
            sqDist
        }

        def findClosest(
              centers: TraversableOnce[Vector],
              point: Vector): (Int, Double) = {
            var bestDistance = Double.PositiveInfinity
            var bestIndex = 0
            var i = 0
            centers.foreach { center =>
              // Since `\|a - b\| \geq |\|a\| - \|b\||`, we can use this lower bound to avoid unnecessary
              // distance computation.
              var lowerBoundOfSqDist = Vectors.norm(center, 2.0) - Vectors.norm(point, 2.0)
              lowerBoundOfSqDist = lowerBoundOfSqDist * lowerBoundOfSqDist
              if (lowerBoundOfSqDist < bestDistance) {
                val distance: Double = fastSquaredDistance(center, Vectors.norm(center, 2.0), point, Vectors.norm(point, 2.0))
                if (distance < bestDistance) {
                  bestDistance = distance
                  bestIndex = i
                }
              }
              i += 1
            }
            (bestIndex, bestDistance)
        }

         def pointCost(
              centers: TraversableOnce[Vector],
              point: Vector): Double =
            findClosest(centers, point)._2



        def clusterCentersIter: Iterable[Vector] =
            clusterCenters.map(p => p)


        def computeCostZep(indata: RDD[Vector]): Double = {
            val bcCenters = indata.context.broadcast(clusterCenters)
            indata.map(p => pointCost(bcCenters.value, p)).sum()
          }

        computeCostZep(projectedData)

I believe I am using all of the same parallelization jobs as spark, but it doesn't work for me. Any advice at making my code distributed/helping see why memory overflows happen in my code would be very helpful

Here is a link to the source code in spark which is very similar: KMeansModel and KMeans

and this is the code which does run fine:

val clusters = KMeans.train(projectedData, numClusters, numIterations)

val clusterCenters = clusters.clusterCenters




// Evaluate clustering by computing Within Set Sum of Squared Errors
val WSSSE = clusters.computeCost(projectedData)
println("Within Set Sum of Squared Errors = " + WSSSE)

Here is the error output:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 94.0 failed 4 times, most recent failure: Lost task 1.3 in stage 94.0 (TID 37663, ip-172-31-13-209.ec2.internal): java.lang.StackOverflowError at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$$$$$c57ec8bf9b0d5f6161b97741d596ff0$$$$wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.dot(:226) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$$$$$c57ec8bf9b0d5f6161b97741d596ff0$$$$wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.dot(:226) ...

and later down:

Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952) at org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1088) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) at org.apache.spark.rdd.RDD.fold(RDD.scala:1082) at org.apache.spark.rdd.DoubleRDDFunctions$$anonfun$sum$1.apply$mcD$sp(DoubleRDDFunctions.scala:34) at org.apache.spark.rdd.DoubleRDDFunctions$$anonfun$sum$1.apply(DoubleRDDFunctions.scala:34) at org.apache.spark.rdd.DoubleRDDFunctions$$anonfun$sum$1.apply(DoubleRDDFunctions.scala:34) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) at org.apache.spark.rdd.DoubleRDDFunctions.sum(DoubleRDDFunctions.scala:33)

410

asked May 29 '16 15:05

user3494047

1 Answers

Seems pretty straightforward what is happening: you are recursively invoking the dot method here:

def dot(x: Vector, y: Vector): Double = {
        require(x.size == y.size,
          "BLAS.dot(x: Vector, y:Vector) was given Vectors with non-matching sizes:" +
          " x.size = " + x.size + ", y.size = " + y.size)
        (x, y) match {
          case (dx: DenseVector, dy: DenseVector) =>
            dot(dx, dy)
          case (sx: SparseVector, dy: DenseVector) =>
            dot(sx, dy)
          case (dx: DenseVector, sy: SparseVector) =>
            dot(sy, dx)
          case (sx: SparseVector, sy: SparseVector) =>
            dot(sx, sy)
          case _ =>
            throw new IllegalArgumentException(s"dot doesn't support (${x.getClass}, ${y.getClass}).")
        }
     }

The succeeding recursive invocations to dot are with the same arguments as the former - therefore there is never a conclusion to the recursion.

The stacktrace tells you that as well - notice the location is at the dot method:

java.lang.StackOverflowError at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$$$$$c57ec8bf9b0d5f6161b97741d596ff0$$$$wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.dot(:226) at

173

answered Oct 26 '22 06:10

WestCoastProjects

Related questions
                            
                                Java / Scala Future driven by a callback
                            
                                Equivalent of Distributed Cache in Spark? [duplicate]
                            
                                Split an iterator by a predicate
                            
                                Get a range of columns of Spark RDD
                            
                                What is the difference between curly bracket and parenthesis in a Scala for loop?
                            
                                scala multi sbt project: object is not a member of package, not found type
                            
                                Type information is lost when working with high-kinded objects in Scala
                            
                                SBT increase compilation memory
                            
                                Scala Future error handling with Either
                            
                                Best way to share case classes-based messages in an Akka microservice architecture
                            
                                Pulling out shapeless polymorphic functions that have dependencies
                            
                                How do I load a avro schema in IntelliJ?
                            
                                How to do custom operations on GroupedData in Spark?
                            
                                Applying IndexToString to features vector in Spark
                            
                                scala case class copy implementation
                            
                                What's the difference between using DatabaseConfig and Database in Slick?
                            
                                Extract scala source code from jar
                            
                                Understanding Type Projection
                            
                                Spark 1.6 apply function to column with dot in name/ How to properly escape colName
                            
                                How to Reference Spark Broadcast Variables Outside of Scope

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With