I have a data set size of 10GB(example Test.txt).
I wrote my pyspark script like below(Test.py):
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
spark = SparkSession.builder.appName("FilterProduct").getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)
lines = spark.read.text("C:/Users/test/Desktop/Test.txt").rdd
lines.collect()
Then I am executing the above script using below command :
spark-submit Test.py --executor-memory 12G
Then I am getting error like below:
17/12/29 13:27:18 INFO FileScanRDD: Reading File path: file:///C:/Users/test/Desktop/Test.txt, range: 402653184-536870912, partition values: [empty row]
17/12/29 13:27:18 INFO CodeGenerator: Code generated in 22.743725 ms
17/12/29 13:27:44 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3230)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:383)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/12/29 13:27:44 ERROR Executor: Exception in task 2.0 in stage 0.0 (TID 2)
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3230)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93
Please let me know how to resolve this?
You can resolve it by setting the partition size: increase the value of spark. sql. shuffle. partitions.
To enlarge the Spark shuffle service memory size, modify SPARK_DAEMON_MEMORY in $SPARK_HOME/conf/spark-env.sh, the default value is 2g, and then restart shuffle to make the change take effect.
The off-heap memory is outside the ambit of Garbage Collection, hence it provides more fine-grained control over the memory for the application developer. Spark uses off-heap memory for two purposes: A part of off-heap memory is used by Java internally for purposes like String interning and JVM overheads.
In your apache-spark
directory check you have the file apache-spark/2.4.0/libexec/conf/spark-defaults.conf
where 2.4.0
corresponds to apache-spark version.
If this file does not exist, create it.
Then insert at the end of the file: spark.driver.memory 12g
.
This should solve without the need of --executor-memory 12G
: just do spark-submit Test.py
.
You could try --conf "spark.driver.maxResultSize=20g"
. You should check the configurations on spark conf page.spark.apache.org/docs/latest/configuration.html
.
In addition to this answer I would like to suggest you to reduce your tasks result otherwise you could have trouble with serialization.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With