I want to step through a python-spark code while still using yarn. The way I current do it is to start pyspark shell, copy-paste and then execute the code line by line. I wonder whether there is a better way.
pdb.set_trace()
would be a much more efficient option if it works. I tried it with spark-submit --master yarn --deploy-mode client
. The program did stop and give me a shell at the line where pdb.set_trace()
was called. However, any pdb commands entered in the shell simply hanged. The pdb.set_trace()
was inserted between spark function calls which, as I understand, should be executed in the driver that runs locally and with a terminal attached.
I read this post How can pyspark be called in debug mode? which seems to suggest the use of pdb is impossible without relying IDE(PyCharm). However, if interactively running spark code is possible, there should be a way to ask python-spark "run all the way until this line and give me a shell for REPL(interactive use). I haven't found any ways to do this. Any suggestions/references are appreciated.
I also experienced the hanging of pdb. I found pdb_clone, and it works like a charm.
First, install pdb_clone
> pip install pdb_clone
Then, include these lines where you want to debug.
from pdb_clone import pdb
pdb.set_trace_remote()
When your program is on that line, run pdb-attach command on another terminal.
> pdb-attach
Check out this tool called pyspark_xray which enables you to step into 100% of your PySpark code using PyCharm, below is a high level summary extracted from its doc.
pyspark_xray is a diagnostic tool, in the form of Python library, for pyspark developers to debug and troubleshoot PySpark applications locally, specifically it enables local debugging of PySpark RDD or DataFrame transformation functions that run on slave nodes.
The purpose of developing pyspark_xray is to create a development framework that enables PySpark application developers to debug and troubleshoot locally and do production runs remotely using the same code base of a pyspark application. For the part of debugging Spark application code locally, pyspark_xray specifically provides capability of locally debugging Spark application code that runs on slave nodes, the missing of this capability is an unfilled gap for Spark application developers right now.
For developers, it's very important to do step-by-step debugging of every part of an application locally in order to diagnose, troubleshoot and solve problems during development.
If you develop PySpark applications, you know that PySpark application code is made up of two categories:
While code on master node can be accessed by a debugger locally, code on slave nodes is like a blackbox and not accessible locally by debugger.
Plenty tutorials on web have covered steps of debugging PySpark code that runs on master node, but when it comes to debugging PySpark code that runs on slave nodes, no solution can be found, most people refer to this part of code either as a blackbox or no need to do debugging.
Spark code that runs on slave nodes includes but is not limited to: lambda functions that are passed as input parameter to RDD transformation functions.
pyspark_xray library enables developers to locally debug (step into) 100% of Spark application code, not only code that runs on master node, but also code that runs on slave nodes, using PyCharm and other popular IDE such as VSCode.
This library achieves these capabilties by using the following techniques:
CONST_BOOL_LOCAL_MODE
from pyspark_xray's const.py auto-detects whether local mode is on or off based on current OS, with values:
in your Spark code base, you can locally debug and remotely execute your Spark application using the same code base.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With