Google Cloud Dataproc configuration issues

Tags:

I've been encountering various issues in some Spark LDA topic modeling (mainly disassociation errors at seemingly random intervals) I've been running, which I think mainly have to do with insufficient memory allocation on my executors. This would seem to be related to problematic automatic cluster configuration. My latest attempt uses n1-standard-8 machines (8 cores, 30GB RAM) for both the master and worker nodes (6 workers, so 48 total cores).

But when I look at /etc/spark/conf/spark-defaults.conf I see this:

spark.master yarn-client
spark.eventLog.enabled true
spark.eventLog.dir hdfs://cluster-3-m/user/spark/eventlog

# Dynamic allocation on YARN
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors 1
spark.dynamicAllocation.initialExecutors 100000
spark.dynamicAllocation.maxExecutors 100000
spark.shuffle.service.enabled true
spark.scheduler.minRegisteredResourcesRatio 0.0

spark.yarn.historyServer.address cluster-3-m:18080
spark.history.fs.logDirectory hdfs://cluster-3-m/user/spark/eventlog

spark.executor.cores 4
spark.executor.memory 9310m
spark.yarn.executor.memoryOverhead 930

# Overkill
spark.yarn.am.memory 9310m
spark.yarn.am.memoryOverhead 930

spark.driver.memory 7556m
spark.driver.maxResultSize 3778m
spark.akka.frameSize 512

# Add ALPN for Bigtable
spark.driver.extraJavaOptions -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.3.v20150130.jar
spark.executor.extraJavaOptions -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.3.v20150130.jar

But these values don't make much sense. Why use only 4/8 executor cores? And only 9.3 / 30GB RAM? My impression was that all this config was supposed to be handled automatically, but even my attempts at manual tweaking aren't getting me anywhere.

For instance, I tried launching the shell with:

spark-shell --conf spark.executor.cores=8 --conf spark.executor.memory=24g

But then this failed with

java.lang.IllegalArgumentException: Required executor memory (24576+930 MB) is above the max threshold (22528 MB) of this cluster! Please increase the value of 'yarn.scheduler.maximum-allocation-mb'.

I tried changing the associated value in /etc/hadoop/conf/yarn-site.xml, to no effect. Even when I try a different cluster setup (e.g. using executors with 60+ GB RAM) I end up with the same problem. For some reason the max threshold remains at 22528MB.

Is there something I'm doing wrong here, or is this is a problem with Google's automatic configuration?

465

asked Dec 07 '15 18:12

moustachio

1 Answers

There are some known issues with default memory configs in clusters where the master machine type is different from the worker machine type, though in your case that doesn't appear to be the main issue.

When you saw the following:

spark.executor.cores 4
spark.executor.memory 9310m

this actually means that each worker node will run 2 executors, and each executor will utilize 4 cores such that all 8 cores are indeed used up on each worker. This way, if we give the AppMaster half of one machine, the AppMaster can successfully be packed next to an executor.

The amount of memory given to NodeManagers needs to leave some overhead for the NodeManager daemon itself, and misc. other daemon services such as the DataNode, so ~80% is left for NodeManagers. Additionally, allocations must be a multiple of the minimum YARN allocation, so after flooring to the nearest allocation multiple, that's where the 22528MB comes from for n1-standard-8.

If you add workers that have 60+ GB of RAM, then as long as you use a master node of the same memory size then you should be seeing a higher max threshold number.

Either way, if you're seeing OOM issues, then it's not so much the memory per-executor that matters the most, but rather the memory per-task. And if you are increasing spark.executor.cores at the same time as spark.executor.memory, then the memory per-task isn't actually being increased, so you won't really be giving more headroom to your application logic in that case; Spark will use spark.executor.cores to determine the number of concurrent tasks to run in the same memory space.

To actually get more memory per task, you should mostly try:

Use n1-highmem-* machine types
Try reducing spark.executor.cores while leaving spark.executor.memory the same
Try increasing spark.executor.memory while leaving spark.executor.cores the same

If you do (2) or (3) above then you'll indeed be leaving cores idle compared to the default config which tries to occupy all cores, but that's really the only way to get more memory per-task aside from going to highmem instances.

141

answered Oct 01 '22 14:10

Dennis Huo

Related questions
                            
                                Set python path for Spark worker
                            
                                Spark Source code: How to understand withScope method
                            
                                Difference between mapreduce split and spark paritition
                            
                                Sequences in Spark dataframe
                            
                                How to add empty map type column to DataFrame?
                            
                                Why does Spark (on Google Dataproc) not use all vcores?
                            
                                How can we convert an external table to managed table in SPARK 2.2.0?
                            
                                How to execute Column expression in spark without dataframe
                            
                                Slowdown with repeated calls to spark dataframe in memory
                            
                                Difference between df.SaveAsTable and spark.sql(Create table..)
                            
                                Cannot do simple task on ec2 spark cluster from local pyspark
                            
                                Apache Spark -- MlLib -- Collaborative filtering
                            
                                AWS EMR and Spark 1.0.0
                            
                                Apache spark in memory caching
                            
                                How to load directory of JSON files into Apache Spark in Python
                            
                                How to submit spark job from within java program to standalone spark cluster without using spark-submit?
                            
                                Apache Spark GraphX connected components
                            
                                What are Spark RDD graph, lineage graph, DAG of Spark tasks? what are their relations
                            
                                Cassandra timeout during read query at consistency ONE (1 responses were required but only 0 replica responded)
                            
                                What is the equivalent to scala.util.Try in pyspark?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Google Cloud Dataproc configuration issues

Tags:

google-cloud-platform

apache-spark

google-cloud-dataproc

lda

moustachio

People also ask

1 Answers

Dennis Huo

Recent Activity

Donate For Us