I have IntelliJ IDEA set up with Apache Spark 1.4. I want to be able to add debug points to my Spark Python scripts so that I can debug them easily. I am currently running this bit of Python to initialise the spark process <pre class="prettyprint"><code>proc = subprocess.Popen([SPARK_SUBMIT_PATH, scriptFile, inputFile], shell=SHELL_OUTPUT, stdout=subprocess.PIPE) if VERBOSE: print proc.stdout.read() print proc.stderr.read() </code></pre> When <code>spark-submit</code> eventually calls <code>myFirstSparkScript.py</code>, the debug mode is not engaged and it executes as normal. Unfortunately, editing the Apache Spark source code and running a customised copy is not an acceptable solution. Does anyone know if it is possible to have spark-submit call the Apache Spark script in debug mode? If so, how?

As far as I understand your intentions what you want is not directly possible given Spark architecture. Even without <code>subprocess</code> call the only part of your program that is accessible directly on a driver is a <code>SparkContext</code>. From the rest you're effectively isolated by different layers of communication, including at least one (in the local mode) JVM instance. To illustrate that, lets use a diagram from PySpark Internals documentation. <img src="https://i.stack.imgur.com/QoOej.jpg" alt="enter image description here"> What is in the left box is the part that is accessible locally and could be used to attach a debugger. Since it is most limited to JVM calls there is really nothing there that should of interest for you, unless you're actually modifying PySpark itself. What is on the right happens remotely and depending on a cluster manager you use is pretty much a black-box from an user perspective. Moreover there are many situations when Python code on the right does nothing more than calling JVM API. This is was the bad part. The good part is that most of the time there should be no need for remote debugging. Excluding accessing objects like <code>TaskContext</code>, which can be easily mocked, every part of your code should be easily runnable / testable locally without using Spark instance whatsoever. Functions you pass to actions / transformations take standard and predictable Python objects and are expected to return standard Python objects as well. What is also important these should be side effects free So at the end of the day you have to parts of your program - a thin layer that can be accessed interactively and tested based purely on inputs / outputs and "computational core" which doesn't require Spark for testing / debugging. <h3>Other options</h3> That being said, you're not completely out of options here. <h3>Local mode</h3> (passively attach debugger to a running interpreter) Both plain GDB and PySpark debugger can be attached to a running process. This can be done only, once PySpark daemon and /or worker processes have been started. In local mode you can force it by executing a dummy action, for example: <pre class="prettyprint"><code>sc.parallelize([], n).count() </code></pre> where <code>n</code> is a number of "cores" available in the local mode (<code>local[n]</code>). Example procedure step-by-step on Unix-like systems: <ul> <li> Start PySpark shell: <pre class="prettyprint"><code>$SPARK_HOME/bin/pyspark </code></pre> </li> <li> Use <code>pgrep</code> to check there is no daemon process running: <pre class="prettyprint"><code>➜ spark-2.1.0-bin-hadoop2.7$ pgrep -f pyspark.daemon ➜ spark-2.1.0-bin-hadoop2.7$ </code></pre> </li> <li> The same thing can be determined in PyCharm by: <kbd>alt</kbd>+<kbd>shift</kbd>+<kbd>a</kbd> and choosing Attach to Local Process: <img src="https://i.stack.imgur.com/xS3cw.png" alt="enter image description here"> or Run -> Attach to Local Process. At this point you should see only PySpark shell (and possibly some unrelated processes). <img src="https://i.stack.imgur.com/YhDeS.png" alt="enter image description here"> </li> <li> Execute dummy action: sc.parallelize([], 1).count() </li> <li> Now you should see both <code>daemon</code> and <code>worker</code> (here only one): <pre class="prettyprint"><code>➜ spark-2.1.0-bin-hadoop2.7$ pgrep -f pyspark.daemon 13990 14046 ➜ spark-2.1.0-bin-hadoop2.7$ </code></pre> and <img src="https://i.stack.imgur.com/sizw4.png" alt="enter image description here"> The process with lower <code>pid</code> is a daemon, the one with higher <code>pid</code> is (possibly) ephemeral worker. </li> <li> At this point you can attach debugger to a process of interest: <ul> <li>In PyCharm by choosing the process to connect.</li> <li> With plain GDB by calling: <pre class="prettyprint"><code>gdb python <pid of running process> </code></pre> </li> </ul> </li> </ul> The biggest disadvantage of this approach is that you have find the right interpreter at the right moment. <h3>Distributed mode</h3> (Using active component which connects to debugger server) With PyCharm PyCharm provides Python Debug Server which can be used with PySpark jobs. First of all you should add a configuration for remote debugger: <ul> <li> <kbd>alt</kbd>+<kbd>shift</kbd>+<kbd>a</kbd> and choose Edit Configurations or Run -> Edit Configurations.</li> <li>Click on Add new configuration (green plus) and choose Python Remote Debug.</li> <li> Configure host and port according to your own configuration (make sure that port and be reached from a remote machine) <img src="https://i.stack.imgur.com/StfNJ.png" alt="enter image description here"> </li> <li> Start debug server: <kbd>shift</kbd>+<kbd>F9</kbd> You should see debugger console: <img src="https://i.stack.imgur.com/skrlX.png" alt="enter image description here"> </li> <li>Make sure that <code>pyddev</code> is accessible on the worker nodes, either by installing it or distributing the <code>egg</code> file.</li> <li> <code>pydevd</code> uses an active component which has to be included in your code: <pre class="prettyprint"><code>import pydevd pydevd.settrace(<host name>, port=<port number>) </code></pre> The tricky part is to find the right place to include it and unless you debug batch operations (like functions passed to <code>mapPartitions</code>) it may require patching PySpark source itself, for example <code>pyspark.daemon.worker</code> or <code>RDD</code> methods like <code>RDD.mapPartitions</code>. Let's say we are interested in debugging worker behavior. Possible patch can look like this: <pre class="prettyprint"><code>diff --git a/python/pyspark/daemon.py b/python/pyspark/daemon.py index 7f06d4288c..6cff353795 100644 --- a/python/pyspark/daemon.py +++ b/python/pyspark/daemon.py @@ -44,6 +44,9 @@ def worker(sock): """ Called by a worker process after the fork(). """ + import pydevd + pydevd.settrace('foobar', port=9999, stdoutToServer=True, stderrToServer=True) + signal.signal(SIGHUP, SIG_DFL) signal.signal(SIGCHLD, SIG_DFL) signal.signal(SIGTERM, SIG_DFL) </code></pre> If you decide to patch Spark source be sure to use patched source not packaged version which is located in <code>$SPARK_HOME/python/lib</code>. </li> <li> Execute PySpark code. Go back to the debugger console and have fun: <img src="https://i.stack.imgur.com/9x0Tv.png" alt="enter image description here"> </li> </ul> Other tools There is a number of tools, including python-manhole or <code>pyrasite</code> which can be used, with some effort, to work with PySpark. Note: Of course, you can use "remote" (active) methods with local mode and, up to some extent "local" methods with distributed mode (you can connect to the worker node and follow the same steps as in the local mode).

How can PySpark be called in debug mode?

Tags:

python

intellij-idea

python-2.7

apache-spark

hadoop

I have IntelliJ IDEA set up with Apache Spark 1.4.

I want to be able to add debug points to my Spark Python scripts so that I can debug them easily.

I am currently running this bit of Python to initialise the spark process

proc = subprocess.Popen([SPARK_SUBMIT_PATH, scriptFile, inputFile], shell=SHELL_OUTPUT, stdout=subprocess.PIPE)

if VERBOSE:
    print proc.stdout.read()
    print proc.stderr.read()

When spark-submit eventually calls myFirstSparkScript.py, the debug mode is not engaged and it executes as normal. Unfortunately, editing the Apache Spark source code and running a customised copy is not an acceptable solution.

Does anyone know if it is possible to have spark-submit call the Apache Spark script in debug mode? If so, how?

264

asked Jul 06 '15 11:07

Toby Leheup

1 Answers

As far as I understand your intentions what you want is not directly possible given Spark architecture. Even without subprocess call the only part of your program that is accessible directly on a driver is a SparkContext. From the rest you're effectively isolated by different layers of communication, including at least one (in the local mode) JVM instance. To illustrate that, lets use a diagram from PySpark Internals documentation.

enter image description here

What is in the left box is the part that is accessible locally and could be used to attach a debugger. Since it is most limited to JVM calls there is really nothing there that should of interest for you, unless you're actually modifying PySpark itself.

What is on the right happens remotely and depending on a cluster manager you use is pretty much a black-box from an user perspective. Moreover there are many situations when Python code on the right does nothing more than calling JVM API.

This is was the bad part. The good part is that most of the time there should be no need for remote debugging. Excluding accessing objects like TaskContext, which can be easily mocked, every part of your code should be easily runnable / testable locally without using Spark instance whatsoever.

Functions you pass to actions / transformations take standard and predictable Python objects and are expected to return standard Python objects as well. What is also important these should be side effects free

So at the end of the day you have to parts of your program - a thin layer that can be accessed interactively and tested based purely on inputs / outputs and "computational core" which doesn't require Spark for testing / debugging.

Other options

That being said, you're not completely out of options here.

Local mode

(passively attach debugger to a running interpreter)

Both plain GDB and PySpark debugger can be attached to a running process. This can be done only, once PySpark daemon and /or worker processes have been started. In local mode you can force it by executing a dummy action, for example:

sc.parallelize([], n).count()

where n is a number of "cores" available in the local mode (local[n]). Example procedure step-by-step on Unix-like systems:

Start PySpark shell:
```
$SPARK_HOME/bin/pyspark 
```

Use pgrep to check there is no daemon process running:

➜  spark-2.1.0-bin-hadoop2.7$ pgrep -f pyspark.daemon
➜  spark-2.1.0-bin-hadoop2.7$

The same thing can be determined in PyCharm by:

alt+shift+a and choosing Attach to Local Process:

or Run -> Attach to Local Process.

At this point you should see only PySpark shell (and possibly some unrelated processes).
Execute dummy action:

sc.parallelize([], 1).count()
Now you should see both daemon and worker (here only one):
```
➜  spark-2.1.0-bin-hadoop2.7$ pgrep -f pyspark.daemon
13990
14046
➜  spark-2.1.0-bin-hadoop2.7$
```
and

The process with lower pid is a daemon, the one with higher pid is (possibly) ephemeral worker.
At this point you can attach debugger to a process of interest:
- In PyCharm by choosing the process to connect.
- With plain GDB by calling:
```
gdb python <pid of running process>
```

The biggest disadvantage of this approach is that you have find the right interpreter at the right moment.

Distributed mode

(Using active component which connects to debugger server)

With PyCharm

PyCharm provides Python Debug Server which can be used with PySpark jobs.

First of all you should add a configuration for remote debugger:

alt+shift+a and choose Edit Configurations or Run -> Edit Configurations.
Click on Add new configuration (green plus) and choose Python Remote Debug.
Configure host and port according to your own configuration (make sure that port and be reached from a remote machine)
Start debug server:

shift+F9

You should see debugger console:
Make sure that pyddev is accessible on the worker nodes, either by installing it or distributing the egg file.

pydevd uses an active component which has to be included in your code:

import pydevd
pydevd.settrace(<host name>, port=<port number>)

The tricky part is to find the right place to include it and unless you debug batch operations (like functions passed to mapPartitions) it may require patching PySpark source itself, for example pyspark.daemon.worker or RDD methods like RDD.mapPartitions. Let's say we are interested in debugging worker behavior. Possible patch can look like this:

diff --git a/python/pyspark/daemon.py b/python/pyspark/daemon.py
index 7f06d4288c..6cff353795 100644
--- a/python/pyspark/daemon.py
+++ b/python/pyspark/daemon.py
@@ -44,6 +44,9 @@ def worker(sock):
     """
     Called by a worker process after the fork().
     """
+    import pydevd
+    pydevd.settrace('foobar', port=9999, stdoutToServer=True, stderrToServer=True)
+
     signal.signal(SIGHUP, SIG_DFL)
     signal.signal(SIGCHLD, SIG_DFL)
     signal.signal(SIGTERM, SIG_DFL)

If you decide to patch Spark source be sure to use patched source not packaged version which is located in $SPARK_HOME/python/lib.

Execute PySpark code. Go back to the debugger console and have fun:

Other tools

There is a number of tools, including python-manhole or pyrasite which can be used, with some effort, to work with PySpark.

Note:

Of course, you can use "remote" (active) methods with local mode and, up to some extent "local" methods with distributed mode (you can connect to the worker node and follow the same steps as in the local mode).

181

answered Oct 18 '22 03:10

zero323

Related questions
                            
                                How do I share a mutable object between threads using Arc?
                            
                                Template partial ordering - why does partial deduction succeed here
                            
                                How to put RelativeLayout inside CoordinatorLayout
                            
                                System Python conflict between Anaconda and existing Python installation
                            
                                Change Button Color onPress (toggle functionality) React Native
                            
                                Protobuf 3.0 Any Type pack/unpack
                            
                                Is there a fast way to rebase a long history of commits to master branch?
                            
                                Ruby on Rails ActiveRecord scopes vs class methods
                            
                                "WHERE x IN y" clause with dapper and postgresql throwing 42601: syntax error at or near \"$1\"
                            
                                Understanding `std::is_move_constructible`
                            
                                React-dnd what does $splice do
                            
                                Remove or delete old data from elastic search

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With