Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark: StackOverflowError when trying to indexing string columns

I have csv file with about 5000 rows and 950 columns. First I load it to DataFrame:

val data = sqlContext.read
  .format(csvFormat)
  .option("header", "true")
  .option("inferSchema", "true")
  .load(file)
  .cache()

After that I search all string columns

val featuresToIndex = data.schema
  .filter(_.dataType == StringType)
  .map(field => field.name)

and want to index them. For that I create indexers for each string column

val stringIndexers = featuresToIndex.map(colName =>
  new StringIndexer()
    .setInputCol(colName)
    .setOutputCol(colName + "Indexed"))

and create pipeline

val pipeline = new Pipeline().setStages(stringIndexers.toArray)

But when I try to transform my initial dataframe with this pipeline

val indexedDf = pipeline.fit(data).transform(data)

I get StackOverflowError

16/07/05 16:55:12 INFO DAGScheduler: Job 4 finished: countByValue at StringIndexer.scala:86, took 7.882774 s
Exception in thread "main" java.lang.StackOverflowError
at scala.collection.immutable.Set$Set1.contains(Set.scala:84)
at scala.collection.immutable.Set$Set1.$plus(Set.scala:86)
at scala.collection.immutable.Set$Set1.$plus(Set.scala:81)
at scala.collection.mutable.SetBuilder.$plus$eq(SetBuilder.scala:22)
at scala.collection.mutable.SetBuilder.$plus$eq(SetBuilder.scala:20)
at scala.collection.generic.Growable$class.loop$1(Growable.scala:53)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:57)
at scala.collection.mutable.SetBuilder.$plus$plus$eq(SetBuilder.scala:20)
at scala.collection.TraversableLike$class.to(TraversableLike.scala:590)
at scala.collection.AbstractTraversable.to(Traversable.scala:104)
at scala.collection.TraversableOnce$class.toSet(TraversableOnce.scala:304)
at scala.collection.AbstractTraversable.toSet(Traversable.scala:104)
at org.apache.spark.sql.catalyst.trees.TreeNode.containsChild$lzycompute(TreeNode.scala:86)
at org.apache.spark.sql.catalyst.trees.TreeNode.containsChild(TreeNode.scala:86)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:280)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
...

What am I doing wrong? Thanks.

like image 339
Andrew Tsibin Avatar asked Jul 05 '16 14:07

Andrew Tsibin


People also ask

What is spark stack overflow?

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing. Learn more… Top users.

What is Java Lang StackOverflowError?

lang. StackOverflowError is a runtime error which points to serious problems that cannot be caught by an application. The java. lang. StackOverflowError indicates that the application stack is exhausted and is usually caused by deep or infinite recursion.

How do I cache my spark?

Caching methods in SparkDISK_ONLY: Persist data on disk only in serialized format. MEMORY_ONLY: Persist data in memory only in deserialized format. MEMORY_AND_DISK: Persist data in memory and if enough memory is not available evicted blocks will be stored on disk. OFF_HEAP: Data is persisted in off-heap memory.


2 Answers

Most probably there is just not enough memory to keep all stack frames. I experienced something similar when trained RandomForestModel. The workaround that works for me is is to run my driver application (that's web service) with additional parameters:

-XX:ThreadStackSize=81920 -Dspark.executor.extraJavaOptions='-XX:ThreadStackSize=81920'
like image 122
evgenii Avatar answered Oct 16 '22 12:10

evgenii


Seems like I found the kind of solution - use spark 2.0. Previously, I used 1.6.2 - it was the latest version at the time of issue. I tried to use the preview version of 2.0, but there is also the problem reproduced.

like image 2
Andrew Tsibin Avatar answered Oct 16 '22 12:10

Andrew Tsibin