Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Incorrect memory allocation for Yarn/Spark after automatic setup of Dataproc Cluster

I'm trying to run Spark jobs on a Dataproc cluster, but Spark will not start due to Yarn being misconfigured.

I receive the following error when running "spark-shell" from the shell (locally on the master), as well as when uploading a job through the web-GUI and the gcloud command line utility from my local machine:

15/11/08 21:27:16 ERROR org.apache.spark.SparkContext: Error initializing     SparkContext.
java.lang.IllegalArgumentException: Required executor memory (38281+2679 MB) is above the max threshold (20480 MB) of this cluster! Please increase the value of 'yarn.s
cheduler.maximum-allocation-mb'.

I tried modifying the value in /etc/hadoop/conf/yarn-site.xml but it didn't change anything. I don't think it pulls the configuration from that file.

I've tried with multiple cluster combinations, at multiple sites (mainly Europe), and I only got this to work with the low memory version (4-cores, 15 gb memory).

I.e. this is only a problem on the nodes configured for memory higher than the yarn default allows.

like image 935
habitats Avatar asked Nov 08 '15 21:11

habitats


People also ask

How to run Spark in YARN mode?

Launching Spark on YARN Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster. These configs are used to write to HDFS and connect to the YARN ResourceManager.

What is static allocation and dynamic allocation in Spark?

There are two ways in which we configure the executor and core details to the Spark job. They are: Static Allocation – The values are given as part of spark-submit. Dynamic Allocation – The values are picked up based on the requirement (size of data, amount of computations needed) and released after use.


1 Answers

Sorry about these issues you're running into! It looks like this is part of a known issue where certain memory settings end up computed based on the master machine's size rather than the worker machines' size, and we're hoping to fix this in an upcoming release soon.

There are two current workarounds:

  1. Use a master machine type with memory either equal to or smaller than worker machine types.
  2. Explicitly set spark.executor.memory and spark.executor.cores either using the --conf flag if running from an SSH connection like:

    spark-shell --conf spark.executor.memory=4g --conf spark.executor.cores=2
    

    or if running gcloud beta dataproc, use --properties:

    gcloud beta dataproc jobs submit spark --properties spark.executor.memory=4g,spark.executor.cores=2
    

You can adjust the number of cores/memory per executor as necessary; it's fine to err on the side of smaller executors and letting YARN pack lots of executors onto each worker, though you can save some per-executor overhead by setting spark.executor.memory to the full size available in each YARN container and spark.executor.cores to all the cores in each worker.

EDIT: As of January 27th, new Dataproc clusters will now be configured correctly for any combination of master/worker machine types, as mentioned in the release notes.

like image 179
Dennis Huo Avatar answered Oct 23 '22 12:10

Dennis Huo