Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark 1.4 increase maxResultSize memory

I am using Spark 1.4 for my research and struggling with the memory settings. My machine has 16GB of memory so no problem there since the size of my file is only 300MB. Although, when I try to convert Spark RDD to panda dataframe using toPandas() function I receive the following error:

serialized results of 9 tasks (1096.9 MB) is bigger than spark.driver.maxResultSize (1024.0 MB) 

I tried to fix this changing the spark-config file and still getting the same error. I've heard that this is a problem with spark 1.4 and wondering if you know how to solve this. Any help is much appreciated.

like image 466
ahajib Avatar asked Jun 25 '15 18:06

ahajib


People also ask

How do I set PySpark driver memory?

You can tell the JVM to instantiate itself (JVM) with 9g of driver memory by using SparkConf . or in your default properties file. You can tell SPARK in your environment to read the default settings from SPARK_CONF_DIR or $SPARK_HOME/conf where the driver-memory can be configured. Spark is also fine with this.

What is Spark executor memory?

An executor is a process that is launched for a Spark application on a worker node. Each executor memory is the sum of yarn overhead memory and JVM Heap memory. JVM Heap memory comprises of: RDD Cache Memory. Shuffle Memory.


1 Answers

You can set spark.driver.maxResultSize parameter in the SparkConf object:

from pyspark import SparkConf, SparkContext  # In Jupyter you have to stop the current context first sc.stop()  # Create new config conf = (SparkConf()     .set("spark.driver.maxResultSize", "2g"))  # Create new context sc = SparkContext(conf=conf) 

You should probably create a new SQLContext as well:

from pyspark.sql import SQLContext sqlContext = SQLContext(sc) 
like image 190
zero323 Avatar answered Sep 24 '22 08:09

zero323