I have csv file with about 5000 rows and 950 columns. First I load it to DataFrame:
val data = sqlContext.read
.format(csvFormat)
.option("header", "true")
.option("inferSchema", "true")
.load(file)
.cache()
After that I search all string columns
val featuresToIndex = data.schema
.filter(_.dataType == StringType)
.map(field => field.name)
and want to index them. For that I create indexers for each string column
val stringIndexers = featuresToIndex.map(colName =>
new StringIndexer()
.setInputCol(colName)
.setOutputCol(colName + "Indexed"))
and create pipeline
val pipeline = new Pipeline().setStages(stringIndexers.toArray)
But when I try to transform my initial dataframe with this pipeline
val indexedDf = pipeline.fit(data).transform(data)
I get StackOverflowError
16/07/05 16:55:12 INFO DAGScheduler: Job 4 finished: countByValue at StringIndexer.scala:86, took 7.882774 s
Exception in thread "main" java.lang.StackOverflowError
at scala.collection.immutable.Set$Set1.contains(Set.scala:84)
at scala.collection.immutable.Set$Set1.$plus(Set.scala:86)
at scala.collection.immutable.Set$Set1.$plus(Set.scala:81)
at scala.collection.mutable.SetBuilder.$plus$eq(SetBuilder.scala:22)
at scala.collection.mutable.SetBuilder.$plus$eq(SetBuilder.scala:20)
at scala.collection.generic.Growable$class.loop$1(Growable.scala:53)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:57)
at scala.collection.mutable.SetBuilder.$plus$plus$eq(SetBuilder.scala:20)
at scala.collection.TraversableLike$class.to(TraversableLike.scala:590)
at scala.collection.AbstractTraversable.to(Traversable.scala:104)
at scala.collection.TraversableOnce$class.toSet(TraversableOnce.scala:304)
at scala.collection.AbstractTraversable.toSet(Traversable.scala:104)
at org.apache.spark.sql.catalyst.trees.TreeNode.containsChild$lzycompute(TreeNode.scala:86)
at org.apache.spark.sql.catalyst.trees.TreeNode.containsChild(TreeNode.scala:86)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:280)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
...
What am I doing wrong? Thanks.
Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing. Learn more… Top users.
lang. StackOverflowError is a runtime error which points to serious problems that cannot be caught by an application. The java. lang. StackOverflowError indicates that the application stack is exhausted and is usually caused by deep or infinite recursion.
Caching methods in SparkDISK_ONLY: Persist data on disk only in serialized format. MEMORY_ONLY: Persist data in memory only in deserialized format. MEMORY_AND_DISK: Persist data in memory and if enough memory is not available evicted blocks will be stored on disk. OFF_HEAP: Data is persisted in off-heap memory.
Most probably there is just not enough memory to keep all stack frames. I experienced something similar when trained RandomForestModel. The workaround that works for me is is to run my driver application (that's web service) with additional parameters:
-XX:ThreadStackSize=81920 -Dspark.executor.extraJavaOptions='-XX:ThreadStackSize=81920'
Seems like I found the kind of solution - use spark 2.0. Previously, I used 1.6.2 - it was the latest version at the time of issue. I tried to use the preview version of 2.0, but there is also the problem reproduced.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With