I am initializing PySpark from within a Jupyter Notebook as follows:
from pyspark import SparkContext
#
conf = SparkConf().setAppName("PySpark-testing-app").setMaster("yarn")
conf = (conf.set("deploy-mode","client")
.set("spark.driver.memory","20g")
.set("spark.executor.memory","20g")
.set("spark.driver.cores","4")
.set("spark.num.executors","6")
.set("spark.executor.cores","4"))
sc = SparkContext(conf=conf)
sqlContext = SQLContext.getOrCreate(sc)
However, when I launch YARN GUI and look into "RUNNING Applications" I see my session being allocated with 1 container, 1 vCPU, and 1GB of RAM, i.e. the default values! Can I get the desired, passing values as listed above?
The cores property controls the number of concurrent tasks an executor can run. - -executor-cores 5 means that each executor can run a maximum of five tasks at the same time.
Jupyter notebook will launch the pyspark with yarn-client mode, the driver memory and some configs cannot be setted with class 'sparkConf'. you must set it in command line.
Take a look at official doc's explains at memory's setting:
Note: In client mode, this config must not be set through the SparkConf
directly in your application, because the driver JVM has already started at that point. Instead, please set this through the --driver-memory command
line option or in your default properties file.
there is another way that can make it.
import os
memory = '20g'
pyspark_submit_args = ' --driver-memory ' + memory + ' pyspark-shell'
os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args
So, other config should be taked with same way like above.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With