Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark raises OutOfMemoryError

When I run my spark python code as below:

import pyspark
conf = (pyspark.SparkConf()
     .setMaster("local")
     .setAppName("My app")
     .set("spark.executor.memory", "512m"))
sc = pyspark.SparkContext(conf = conf)        #start the conf
data =sc.textFile('/Users/tsangbosco/Downloads/transactions')
data = data.flatMap(lambda x:x.split()).take(all)

The file is about 20G and my computer have 8G ram, when I run the program in standalone mode, it raises the OutOfMemoryError:

Exception in thread "Local computation of job 12" java.lang.OutOfMemoryError: Java heap space
    at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131)
    at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:119)
    at org.apache.spark.api.python.PythonRDD$$anon$1.next(PythonRDD.scala:112)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at org.apache.spark.api.python.PythonRDD$$anon$1.foreach(PythonRDD.scala:112)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
    at org.apache.spark.api.python.PythonRDD$$anon$1.to(PythonRDD.scala:112)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
    at org.apache.spark.api.python.PythonRDD$$anon$1.toBuffer(PythonRDD.scala:112)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
    at org.apache.spark.api.python.PythonRDD$$anon$1.toArray(PythonRDD.scala:112)
    at org.apache.spark.api.java.JavaRDDLike$$anonfun$1.apply(JavaRDDLike.scala:259)
    at org.apache.spark.api.java.JavaRDDLike$$anonfun$1.apply(JavaRDDLike.scala:259)
    at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:884)
    at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:884)
    at org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:681)
    at org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:666)

Is spark unable to deal with file larger than my ram? Could you tell me how to fix it?

like image 588
BoscoTsang Avatar asked Jun 19 '26 07:06

BoscoTsang


1 Answers

Spark can handle some case. But you are using take to force Spark to fetch all of the data to an array(in memory). In such case, you should store them to files, like using saveAsTextFile.

If you are interested in looking at some of data, you can use sample or takeSample.

like image 145
zsxwing Avatar answered Jun 20 '26 20:06

zsxwing



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!