Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Drop into a Scala interpreter in Spark script?

I'm using Scala 2.11.8 and Spark 2.1.0. I'm totally new to Scala.

Is there a simple way to add a single line breakpoint, similar to Python:

import pdb; pdb.set_trace()

where I'll be dropped into a Scala shell and I can inspect what's going on at that line of execution in the script? (I'd settle for just the end of the script, too...)

I'm currently starting my scripts like so:

$SPARK_HOME/bin/spark-submit --class "MyClassName" --master local target/scala-2.11/my-class-name_2.11-1.0.jar

Is there a way to do this? Would help debugging immensely.

EDIT: The solutions in this other SO post were not very helpful / required lots of boilerplate + didn't work.

like image 940
lollercoaster Avatar asked Jan 14 '17 06:01

lollercoaster


1 Answers

I would recommend one of the following two options:

Remote debugging & IntelliJ Idea's "evaluate expression"

The basic idea here is that you debug your app like you would if it was just an ordinary piece of code debugged from within your IDE. The Run->Evaluate expression function allows you to prototype code and you can use most of the debugger's usual variable displays, step (over) etc functionality. However, since you're not running the application from within your IDE, you need to:

  1. Setup the IDE for remote debugging, and
  2. Supply the application with the correct Java options for remote debugging.

For 1, go to Run->Edit configurations, hit the + button in the top right hand corner, select remote, and copy the content of the text field under Command line arguments for running remote JVM (official help).

For 2, you can use the SPARK_SUBMIT_OPTS environment variable to pass those JVM options, e.g.:

SPARK_SUBMIT_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005" \
  $SPARK_HOME/bin/spark-submit --class Main --master "spark://127.0.0.1:7077" \
  ./path/to/foo-assembly-1.0.0.jar

Now you can hit the debug button, and set breakpoints etc.

Apache Zeppelin

If you're writing more script-style Scala, you may find it helpful to write it in a Zeppelin Spark Scala interpreter. While it's more like Jupyter/IPython notebooks/the ipython shell than (i)pdb, this does allow you to inspect what's going on at runtime. This will also allow you to graph your data etc. I'd start with these docs.

Caveat

I think the above will only allow debugging code running on the Driver node, not on the Worker nodes (which run your actual map, reduce etc functions). If you for example set a breakpoint inside an anonymous function inside myDataFrame.map{ ... }, it probably won't be hit, since that's executed on some worker node. However, with e.g. myDataFrame.head and the evaluate expression functionality I've been able to fulfil most of my debugging needs. Having said that, I've not tried to specifically pass Java options to executors, so perhaps it's possible (but probably tedious) to get it work.

like image 99
m01 Avatar answered Nov 01 '22 12:11

m01