Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to deal with executor memory and driver memory in Spark?

I am confused about dealing with executor memory and driver memory in Spark.

My environment settings are as below:

  • Memory 128 G, 16 CPU for 9 VM
  • Centos
  • Hadoop 2.5.0-cdh5.2.0
  • Spark 1.1.0

Input data information:

  • 3.5 GB data file from HDFS

For simple development, I executed my Python code in standalone cluster mode (8 workers, 20 cores, 45.3 G memory) with spark-submit. Now I would like to set executor memory or driver memory for performance tuning.

From the Spark documentation, the definition for executor memory is

Amount of memory to use per executor process, in the same format as JVM memory strings (e.g. 512m, 2g).

How about driver memory?

like image 714
wlsherica Avatar asked Nov 28 '14 04:11

wlsherica


People also ask

What is spark driver memory What about Spark executor memory?

Executors are worker nodes' processes in charge of running individual tasks in a given Spark job and The spark driver is the program that declares the transformations and actions on RDDs of data and submits such requests to the master.

How do I resolve a memory problem in Spark?

You can resolve it by setting the partition size: increase the value of spark. sql. shuffle. partitions.


1 Answers

The memory you need to assign to the driver depends on the job.

If the job is based purely on transformations and terminates on some distributed output action like rdd.saveAsTextFile, rdd.saveToCassandra, ... then the memory needs of the driver will be very low. Few 100's of MB will do. The driver is also responsible of delivering files and collecting metrics, but not be involved in data processing.

If the job requires the driver to participate in the computation, like e.g. some ML algo that needs to materialize results and broadcast them on the next iteration, then your job becomes dependent of the amount of data passing through the driver. Operations like .collect,.take and takeSample deliver data to the driver and hence, the driver needs enough memory to allocate such data.

e.g. If you have an rdd of 3GB in the cluster and call val myresultArray = rdd.collect, then you will need 3GB of memory in the driver to hold that data plus some extra room for the functions mentioned in the first paragraph.

like image 183
maasg Avatar answered Sep 18 '22 05:09

maasg