Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

best practice for debugging python-spark code

I want to step through a python-spark code while still using yarn. The way I current do it is to start pyspark shell, copy-paste and then execute the code line by line. I wonder whether there is a better way.

pdb.set_trace() would be a much more efficient option if it works. I tried it with spark-submit --master yarn --deploy-mode client. The program did stop and give me a shell at the line where pdb.set_trace() was called. However, any pdb commands entered in the shell simply hanged. The pdb.set_trace() was inserted between spark function calls which, as I understand, should be executed in the driver that runs locally and with a terminal attached. I read this post How can pyspark be called in debug mode? which seems to suggest the use of pdb is impossible without relying IDE(PyCharm). However, if interactively running spark code is possible, there should be a way to ask python-spark "run all the way until this line and give me a shell for REPL(interactive use). I haven't found any ways to do this. Any suggestions/references are appreciated.

like image 209
sgu Avatar asked Mar 13 '18 02:03

sgu


2 Answers

I also experienced the hanging of pdb. I found pdb_clone, and it works like a charm.

First, install pdb_clone

> pip install pdb_clone

Then, include these lines where you want to debug.

from pdb_clone import pdb
pdb.set_trace_remote()

When your program is on that line, run pdb-attach command on another terminal.

> pdb-attach
like image 87
calee Avatar answered Oct 15 '22 06:10

calee


Check out this tool called pyspark_xray which enables you to step into 100% of your PySpark code using PyCharm, below is a high level summary extracted from its doc.

pyspark_xray is a diagnostic tool, in the form of Python library, for pyspark developers to debug and troubleshoot PySpark applications locally, specifically it enables local debugging of PySpark RDD or DataFrame transformation functions that run on slave nodes.

The purpose of developing pyspark_xray is to create a development framework that enables PySpark application developers to debug and troubleshoot locally and do production runs remotely using the same code base of a pyspark application. For the part of debugging Spark application code locally, pyspark_xray specifically provides capability of locally debugging Spark application code that runs on slave nodes, the missing of this capability is an unfilled gap for Spark application developers right now.

Problem

For developers, it's very important to do step-by-step debugging of every part of an application locally in order to diagnose, troubleshoot and solve problems during development.

If you develop PySpark applications, you know that PySpark application code is made up of two categories:

  • code that runs on master node
  • code that runs on worker/slave nodes

While code on master node can be accessed by a debugger locally, code on slave nodes is like a blackbox and not accessible locally by debugger.

Plenty tutorials on web have covered steps of debugging PySpark code that runs on master node, but when it comes to debugging PySpark code that runs on slave nodes, no solution can be found, most people refer to this part of code either as a blackbox or no need to do debugging.

Spark code that runs on slave nodes includes but is not limited to: lambda functions that are passed as input parameter to RDD transformation functions.

Solution

pyspark_xray library enables developers to locally debug (step into) 100% of Spark application code, not only code that runs on master node, but also code that runs on slave nodes, using PyCharm and other popular IDE such as VSCode.

This library achieves these capabilties by using the following techniques:

  • wrapper functions of Spark code on slave nodes, check out the section to learn more details
  • practice of sampling input data under local debugging mode in order to fit the application into memory of your standalone local PC/Mac
    • For exmple, say your production input data size has 1 million rows, which obviously cannot fit into one standalone PC/Mac's memory, in order to use pyspark_xray, you may take 100 sample rows as input data as input to debug your application locally using pyspark_xray
  • usage of a flag to auto-detect local mode, CONST_BOOL_LOCAL_MODE from pyspark_xray's const.py auto-detects whether local mode is on or off based on current OS, with values:
    • True: if current OS is Mac or Windows
    • False: otherwise

in your Spark code base, you can locally debug and remotely execute your Spark application using the same code base.

like image 40
bradyjiang Avatar answered Oct 15 '22 06:10

bradyjiang