Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to debug Spark application locally?

Tags:

apache-spark

I would like to learn Spark step by step and wonder how to debug a Spark application locally? Could anyone please detail the steps needed to do this?

I can run the simpleApp on the spark website locally from the command line but I just need to step through the code and see how it works.

like image 206
ekardes Avatar asked May 22 '15 18:05

ekardes


People also ask

How do I launch the spark shell in debug mode?

properties" path to log messages and pass it to your spark shell command. Then run the spark-shell as following then you should see DEBUG messages. You can pass your own "log4j. properties" path to log messages and pass it to your spark shell command.

How do you debug a spark in Python?

Firstly, choose Edit Configuration… from the Run menu. It opens the Run/Debug Configurations dialog. You have to click + configuration on the toolbar, and from the list of available configurations, select Python Debug Server.

How do you debug a spark cluster?

start the job. open the Spark UI and find out where your process is running. use the ssh command to forward the port specified in the agent from the target node to your local machine through the edge node. start the remote debug from your IDE using as IP and port localhost and the forwarded port.


2 Answers

Here's how I do it using IntelliJ.

First, make sure you can run your spark application locally using spark-submit, e.g. something like:

spark-submit --class MyMainClass myapplication.jar 

Then, tell your local spark driver to pause and wait for a connection from a debugger when it starts up, by adding an option like the following:

--conf spark.driver.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005 

where agentlib:jdwp is the Java Debug Wire Protocol (JDWP) option, followed by a comma-separated list of sub-options:

  • transport defines the connection protocol used between debugger and debuggee -- either socket or "shared memory" -- you almost always want socket (dt_socket) except I believe in some cases on Microsoft Windows
  • server whether this process should be the server when talking to the debugger (or conversely, the client) -- you always need one server and one client. In this case, we're going to be the server and wait for a connection from the debugger
  • suspend whether to pause execution until a debugger has successfully connected. We turn this on so the driver won't start until the debugger connects
  • address here, this is the port to listen on (for incoming debugger connection requests). You can set it to any available port (you just have to make sure the debugger is configured to connect to this same port)

So now, your spark-submit command line should look something like:

spark-submit \   --name MyApp \   --class MyMainClass \   --conf spark.driver.extraJavaOptions=agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005 

Now if you run the above, you should see something like

Listening for transport dt_socket at address: 5005 

and your spark application is waiting for the debugger to attach.

Next, open the IntelliJ project containing your Spark application, and then open "Run -> Edit Configurations..." Then click the "+" to add a new run/debug configuration, and select "Remote". Give it a name, e.g. "SparkLocal", and select "Socket" for Transport, "Attach" for Debugger mode, and type in "localhost" for Host and the port you used above for Port, in this case, "5005". Click "OK" to save.

In my version of IntelliJ it gives you suggestions for the debug command line to use for the debugged process, and it uses "suspend=n" -- we're ignoring that and using "suspend=y" (as above) because we want the application to wait until we connect to start.

Now you should be ready to debug. Simply start spark with the above command, then select the IntelliJ run configuration you just created and click Debug. IntelliJ should connect to your Spark application, which should now start running. You can set break points, inspect variables, etc.

Spark Shell

With spark-shell simply export SPARK_SUBMIT_OPTS as follows:

export SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005 

Attach to localhost:5005 using your debugger (e.g. IntelliJ IDEA) and with the Spark sources imported, you should be able to step through the code just fine.

like image 139
Jason Evans Avatar answered Oct 18 '22 21:10

Jason Evans


Fire up the Spark shell. This is straight from the Spark documentation:

./bin/spark-shell --master local[2] 

You will also see the Spark shell referred to as the REPL. It is by far the best way to learn Spark. I spend 80% of my time in the Spark shell and the other 20% translating the code into my application.

like image 28
David Griffin Avatar answered Oct 18 '22 19:10

David Griffin