Context: The cluster configuration is as follows:
Problem: I'm trying to create a spark session in pyspark jupyter notebook which runs in cluster deploy mode. I'm trying to get the driver to run on a node that isn't the node where the jupyter notebook is run from. Right now I can run jobs on the cluster but only with the driver running on node2.
After a lot of digging I found this stackoverflow post which claims that if you run an interactive shell with spark you can only do so in local deploy mode (where the driver is located on the machine you are working on). That post goes on to say that something like jupyter hub as result also won't work in cluster deploy mode but I am unable to find any documentation that can confirm this. Can someone confirm if jupyter hub can run in cluster mode at all?
My attempt at running the spark session in cluster deployed mode:
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.enableHiveSupport()\
.config("spark.local.ip",<node 3 ip>)\
.config("spark.driver.host",<node 3 ip>)\
.config('spark.submit.deployMode','cluster')\
.getOrCreate()
Error:
/usr/spark/python/pyspark/sql/session.py in getOrCreate(self)
167 for key, value in self._options.items():
168 sparkConf.set(key, value)
--> 169 sc = SparkContext.getOrCreate(sparkConf)
170 # This SparkContext may be an existing one.
171 for key, value in self._options.items():
/usr/spark/python/pyspark/context.py in getOrCreate(cls, conf)
308 with SparkContext._lock:
309 if SparkContext._active_spark_context is None:
--> 310 SparkContext(conf=conf or SparkConf())
311 return SparkContext._active_spark_context
312
/usr/spark/python/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
113 """
114 self._callsite = first_spark_call() or CallSite(None, None, None)
--> 115 SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
116 try:
117 self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
/usr/spark/python/pyspark/context.py in _ensure_initialized(cls, instance, gateway, conf)
257 with SparkContext._lock:
258 if not SparkContext._gateway:
--> 259 SparkContext._gateway = gateway or launch_gateway(conf)
260 SparkContext._jvm = SparkContext._gateway.jvm
261
/usr/spark/python/pyspark/java_gateway.py in launch_gateway(conf)
93 callback_socket.close()
94 if gateway_port is None:
---> 95 raise Exception("Java gateway process exited before sending the driver its port number")
96
97 # In Windows, ensure the Java child processes do not linger after Python has exited.
Exception: Java gateway process exited before sending the driver its port number
You cannot use cluster mode with PySpark at all:
Currently, standalone mode does not support cluster mode for Python applications.
and even if you could cluster mode is not applicable in interactive environment:
case (_, CLUSTER) if isShell(args.primaryResource) =>
error("Cluster deploy mode is not applicable to Spark shells.")
case (_, CLUSTER) if isSqlShell(args.mainClass) =>
error("Cluster deploy mode is not applicable to Spark SQL shell.")
Jupyter
is through
Apache Livy Basically, Livy
is a REST API service for Spark
cluster.
Jupyter
has a extension "spark-magic" that allows to integrate Livy
with Jupyter
An example of Jupyter
with Spark-magic
bound (driver runs in the yarn cluster and not locally in this case, as mentioned above):
Another way to use YARN-cluster
mode in Jupyter is to use Jupyter Enterprise Gateway
https://jupyter-enterprise-gateway.readthedocs.io/en/latest/kernel-yarn-cluster-mode.html#configuring-kernels-for-yarn-cluster-mode
There are also commercial options that normally use one of the methods I listed above. Some of our users for example are on IBM DSX (aka IBM Watson Studio Local) use Apache Livy
- option number one above.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With