Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to know deploy mode of PySpark application?

I am trying to fix an issue with running out of memory, and I want to know whether I need to change these settings in the default configurations file (spark-defaults.conf) in the spark home folder. Or, if I can set them in the code.

I saw this question PySpark: java.lang.OutofMemoryError: Java heap space and it says that it depends on if I'm running in client mode. I'm running spark on a cluster and monitoring it using standalone.

But, how do I figure out if I'm running spark in client mode?

like image 481
makansij Avatar asked Jul 14 '16 21:07

makansij


People also ask

What is deploy mode in spark-submit?

Deploy mode specifies the location of where driver executes in the deployment environment. Deploy mode can be one of the following options: client (default) - the driver runs on the machine that the Spark application was launched. cluster - the driver runs on a random node in a cluster.

What are different modes of deploying spark cluster?

Spark/PySpark Deploy Modes. In cluster mode, the driver runs on one of the worker nodes, and this node shows as a driver on the Spark Web UI of your application. cluster mode is used to run production jobs. In client mode, the driver runs locally from where you are submitting your application using spark-submit command ...

Which one does spark prefer cluster or client mode?

So, if the client machine is “far” from the worker nodes then it makes sense to use cluster mode. If our application is in a gateway machine quite “close” to the worker nodes, the client mode could be a good choice.


1 Answers

If you are running an interactive shell, e.g. pyspark (CLI or via an IPython notebook), by default you are running in client mode. You can easily verify that you cannot run pyspark or any other interactive shell in cluster mode:

$ pyspark --master yarn --deploy-mode cluster
Python 2.7.11 (default, Mar 22 2016, 01:42:54)
[GCC Intel(R) C++ gcc 4.8 mode] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Error: Cluster deploy mode is not applicable to Spark shells.

$ spark-shell --master yarn --deploy-mode cluster
Error: Cluster deploy mode is not applicable to Spark shells.

Examining the contents of the bin/pyspark file may be instructive, too - here is the final line (which is the actual executable):

$ pwd
/home/ctsats/spark-1.6.1-bin-hadoop2.6
$ cat bin/pyspark
[...]
exec "${SPARK_HOME}"/bin/spark-submit pyspark-shell-main --name "PySparkShell" "$@"

i.e. pyspark is actually a script run by spark-submit and given the name PySparkShell, by which you can find it in the Spark History Server UI; and since it is run like that, it goes by whatever arguments (or defaults) are included with its spark-submit command.

like image 88
desertnaut Avatar answered Dec 24 '22 07:12

desertnaut