If I run a spark program in spark shell, is it possible that the program can hog the entire hadoop cluster for hours?
usually there is a setting called num-executors and executor-cores.
spark-shell --driver-memory 10G --executor-memory 15G --executor-cores 8
but if they are not specified and I just run "spark-shell"... will it consume the entire cluster? or are there reasonable defaults.
instances . The maximum number of executors to be used. Its Spark submit option is --max-executors . If it is not set, default is 2.
The consensus in most Spark tuning guides is that 5 cores per executor is the optimum number of cores in terms of parallel processing.
According to the recommendations which we discussed above: Leave 1 core per node for Hadoop/Yarn daemons => Num cores available per node = 16-1 = 15. So, Total available of cores in cluster = 15 x 10 = 150. Number of available executors = (total cores/num-cores-per-executor) = 150/5 = 30.
Spark Architecture The central coordinator is called Spark Driver and it communicates with all the Workers. Each Worker node consists of one or more Executor(s) who are responsible for running the Task.
The default values for most configuration properties can be found in the Spark Configuration documentation. For the configuration properties on your example, the defaults are:
- spark.driver.memory = 1g
- spark.executor.memory = 1g
- spark.executor.cores = 1 in YARN mode, all the available cores on the worker in standalone mode.
Additionally, you can override these defaults by creating the file$SPARK-HOME/conf/spark-defaults.conf
with the properties you want (as described here). Then, if the file exists with the desired values, you don't need to pass them as arguments to the spark-shell
command.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With