StackOverflowError when operating with a large number of columns in Spark

Tags:

I have a wide dataframe (130000 rows x 8700 columns) and when I try to sum all columns I´m getting the following error:

Exception in thread "main" java.lang.StackOverflowError at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59) at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:183) at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45) at scala.collection.generic.GenericCompanion.apply(GenericCompanion.scala:49) at org.apache.spark.sql.catalyst.expressions.BinaryExpression.children(Expression.scala:400) at org.apache.spark.sql.catalyst.trees.TreeNode.containsChild$lzycompute(TreeNode.scala:88) ...

This is my Scala code:

  val df = spark.read
    .option("header", "false")
    .option("delimiter", "\t")
    .option("inferSchema", "true")
    .csv("D:\\Documents\\Trabajo\\Fábregas\\matrizLuna\\matrizRelativa")


  val arrayList = df.drop("cups").columns
  var colsList = List[Column]()
  arrayList.foreach { c => colsList :+= col(c) }

  val df_suma = df.withColumn("consumo_total", colsList.reduce(_ + _))

If I do the same with a few columns it works fine but I´m always getting the same error when i try the reduce operation with a high number of columns.

Can anyone suggest how can I do it? is there any limitation on the number of columns?

Thx!

483

asked Apr 06 '18 10:04

Jose Antonio Fabregas

1 Answers

You can use a different reduction method that produces a balanced binary tree of depth O(log(n)) instead of a degenerate linearized BinaryExpression chain of depth O(n):

def balancedReduce[X](list: List[X])(op: (X, X) => X): X = list match {
  case Nil => throw new IllegalArgumentException("Cannot reduce empty list")
  case List(x) => x
  case xs => {
    val n = xs.size
    val (as, bs) = list.splitAt(n / 2)
    op(balancedReduce(as)(op), balancedReduce(bs)(op))
  }
}

Now in your code, you can replace

colsList.reduce(_ + _)

balancedReduce(colsList)(_ + _)

A little example to further illustrate what happens with the BinaryExpressions, compilable without any dependencies:

sealed trait FormalExpr
case class BinOp(left: FormalExpr, right: FormalExpr) extends FormalExpr {
  override def toString: String = {
    val lStr = left.toString.split("\n").map("  " + _).mkString("\n")
    val rStr = right.toString.split("\n").map("  " + _).mkString("\n")
    return s"BinOp(\n${lStr}\n${rStr}\n)"
  }
}
case object Leaf extends FormalExpr

val leafs = List.fill[FormalExpr](16){Leaf}

println(leafs.reduce(BinOp(_, _)))
println(balancedReduce(leafs)(BinOp(_, _)))

This is what the ordinary reduce does (and this is what essentially happens in your code):

This is what balancedReduce produces:

BinOp(
  BinOp(
    BinOp(
      BinOp(
        Leaf
        Leaf
      )
      BinOp(
        Leaf
        Leaf
      )
    )
    BinOp(
      BinOp(
        Leaf
        Leaf
      )
      BinOp(
        Leaf
        Leaf
      )
    )
  )
  BinOp(
    BinOp(
      BinOp(
        Leaf
        Leaf
      )
      BinOp(
        Leaf
        Leaf
      )
    )
    BinOp(
      BinOp(
        Leaf
        Leaf
      )
      BinOp(
        Leaf
        Leaf
      )
    )
  )
)

The linearized chain is of length O(n), and when Catalyst is trying to evaluate it, it blows the stack. This should not happen with the flat tree of depth O(log(n)).

And while we are talking about asymptotic runtimes: why are you appending to a mutable colsList? This needs O(n^2) time. Why not simply call toList on the output of .columns?

154

answered Oct 26 '22 00:10

Andrey Tyukin

Related questions
                            
                                How to convert column to vector type?
                            
                                Scala-Spark Dynamically call groupby and agg with parameter values
                            
                                Spark random forest binary classifier metrics
                            
                                Local assignment affects type?
                            
                                How to put a variable into z ZeppelinContext in javascript in Zeppelin?
                            
                                Spark History Server on S3A FileSystem: ClassNotFoundException
                            
                                Can non-persistent data structures be used in a purely functional way?
                            
                                Generic Numeric division
                            
                                Chain functions in different way
                            
                                value read is not a member of org.apache.spark.SparkContext
                            
                                scala.MatchError: [Ljava.lang.String; (of class [Ljava.lang.String;)
                            
                                Inserting Data Into Cassandra table Using Spark DataFrame
                            
                                Dropping columns by data type in Scala Spark
                            
                                Spark: unpersist RDDs for which I have lost the reference
                            
                                How do I extract a Future[String] from an akka Source[ByteString, _]?
                            
                                Workaround for importing spark implicits everywhere
                            
                                Can Flink be used with Kotlin?
                            
                                sbt: How do I resolve Maven dependencies that uses Maven properties
                            
                                How to pass Java_opts before an executable to entrypoint in dockerfile?
                            
                                Mocking a method which returns an fs2.Stream

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

StackOverflowError when operating with a large number of columns in Spark

Tags:

scala

stack-overflow

apache-spark

mapreduce

spark-dataframe

Jose Antonio Fabregas

People also ask

1 Answers

Andrey Tyukin

Recent Activity

Donate For Us