Apache Spark: StackOverflowError when trying to indexing string columns

Tags:

I have csv file with about 5000 rows and 950 columns. First I load it to DataFrame:

val data = sqlContext.read
  .format(csvFormat)
  .option("header", "true")
  .option("inferSchema", "true")
  .load(file)
  .cache()

After that I search all string columns

val featuresToIndex = data.schema
  .filter(_.dataType == StringType)
  .map(field => field.name)

and want to index them. For that I create indexers for each string column

val stringIndexers = featuresToIndex.map(colName =>
  new StringIndexer()
    .setInputCol(colName)
    .setOutputCol(colName + "Indexed"))

and create pipeline

val pipeline = new Pipeline().setStages(stringIndexers.toArray)

But when I try to transform my initial dataframe with this pipeline

val indexedDf = pipeline.fit(data).transform(data)

I get StackOverflowError

16/07/05 16:55:12 INFO DAGScheduler: Job 4 finished: countByValue at StringIndexer.scala:86, took 7.882774 s
Exception in thread "main" java.lang.StackOverflowError
at scala.collection.immutable.Set$Set1.contains(Set.scala:84)
at scala.collection.immutable.Set$Set1.$plus(Set.scala:86)
at scala.collection.immutable.Set$Set1.$plus(Set.scala:81)
at scala.collection.mutable.SetBuilder.$plus$eq(SetBuilder.scala:22)
at scala.collection.mutable.SetBuilder.$plus$eq(SetBuilder.scala:20)
at scala.collection.generic.Growable$class.loop$1(Growable.scala:53)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:57)
at scala.collection.mutable.SetBuilder.$plus$plus$eq(SetBuilder.scala:20)
at scala.collection.TraversableLike$class.to(TraversableLike.scala:590)
at scala.collection.AbstractTraversable.to(Traversable.scala:104)
at scala.collection.TraversableOnce$class.toSet(TraversableOnce.scala:304)
at scala.collection.AbstractTraversable.toSet(Traversable.scala:104)
at org.apache.spark.sql.catalyst.trees.TreeNode.containsChild$lzycompute(TreeNode.scala:86)
at org.apache.spark.sql.catalyst.trees.TreeNode.containsChild(TreeNode.scala:86)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:280)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
...

What am I doing wrong? Thanks.

339

asked Jul 05 '16 14:07

Andrew Tsibin

2 Answers

Most probably there is just not enough memory to keep all stack frames. I experienced something similar when trained RandomForestModel. The workaround that works for me is is to run my driver application (that's web service) with additional parameters:

-XX:ThreadStackSize=81920 -Dspark.executor.extraJavaOptions='-XX:ThreadStackSize=81920'

122

answered Oct 16 '22 12:10

evgenii

Seems like I found the kind of solution - use spark 2.0. Previously, I used 1.6.2 - it was the latest version at the time of issue. I tried to use the preview version of 2.0, but there is also the problem reproduced.

answered Oct 16 '22 12:10

Andrew Tsibin

Related questions
                            
                                Oracle / JDBC: retrieving TIMESTAMP WITH TIME ZONE value in ISO 8601 format
                            
                                Forefront TMG vs java and php (ftp)
                            
                                Fatal signal 6 (SIGABRT) code=-6 on first launch
                            
                                Android L Elevation Effect in Pre L (only using elevation property)
                            
                                Major Performance Issues with Java 8 ScriptEngine Compared to Java 7
                            
                                Detecting unused Spring beans
                            
                                How to support batch web api request processing using Spring/Servlets
                            
                                Eclipse - break on user code when unhandled exception is raised on Android App
                            
                                What is mockito-inline and how does it work to mock final methods?
                            
                                Pet slept on keyboard, weird colors in IDE [duplicate]
                            
                                OneNote parsing - how to get to the Text Blobs in the document?
                            
                                Is there a IKVM for Java? Can I run .NET assemblies on a JVM?
                            
                                Thread pool that binds tasks for a given ID to the same thread
                            
                                Eclipse + Maven + Dynamic Web Project -> Maven overwrites Deployment Assembly
                            
                                Using Concurrent Mark Sweep garbage collector with more than 120GB RAM
                            
                                Alternative to dex2jar and jd-GUI? [closed]
                            
                                Make URLs clickable in Eclipse console
                            
                                Strange JIT pessimization of a loop idiom
                            
                                ObservableList: how to reliably detect a setAll?
                            
                                Android, find all places in code to request permissions

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Apache Spark: StackOverflowError when trying to indexing string columns

Tags:

java

scala

apache-spark

apache-spark-mllib

Andrew Tsibin

People also ask

2 Answers

evgenii

Andrew Tsibin

Recent Activity

Donate For Us