I'd like to debug a Spark Application that is running on an AWS EMR cluster. It would be fantastic if I could connect and debug it remotely using IntelliJ. I've searched but found very little.
Is it possible and if so, could someone kindly point me in the right direction?
Thanks.
First, I would caution that what you're trying to do is basically impossible, due to the multitudinous bugs and unexpected use cases of AWS EMR. I would strongly recommend paying for the largest single instance you can to run your job (they have c4.8xlarge
at the affordable end and x1.32xlarge
for the real crazies!), and simply installing spark
inside that instance and running your job.
nc -l 5005
on your machine. SSH into your master and try echo "test" | nc your_ip_address 5005
. Until you see test
on your machine's terminal, do not proceed.Create a new Remote configuration. Change Debugger Mode to Listen. Name the configuration and save it. When you hit debug, it will be waiting for a connection. In that window, you would see "Command line arguments for running remote JVM", reading something like:
-agentlib:jdwp=transport=dt_socket,server=n,address=localhost:5005,suspend=y
You can remove the onthrow
and oncaught
lines like I did. Suppose your debugging machine is accessible over the Internet at 24.13.242.141
. Pretend it actually read:
-agentlib:jdwp=transport=dt_socket,server=n,address=24.13.242.141:5005,suspend=y
We'll use this to setup debugging on the Spark processes.
There are two processes that can be debugged: the driver process (executing the code where your SparkContext
is instantiated) and the executor process. Ultimately, you will pass these JVM options to a special argument to spark-submit
to get the connection to occur. For debugging the driver, use
spark-submit --driver-java-options -agentlib:jdwp=transport=dt_socket,server=n,address=24.13.242.141:5005,suspend=y --class ...
For debugging executor processes, you would use a configuration option:
spark-submit --conf "spark.executor.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=n,address=24.13.242.141:5005,suspend=y" --class ...
Debugging the executor is extra tricky, since there will be multiple processes. You can't really debug multiple processes the way you imagine in IntelliJ. Also, you can't really limit the number of executors to 1 in AWS EMR, even when they claim you can. I believe it is okay if the other executors will fail (they will when they can't connect to your debugging session). But this step is untested.
You can modify arguments to spark-submit
both with the SDK and the Web Console. Note, in the SDK, you should not attempt to concatenate the "args" yourself—pass them as array items like it asks you to.
You will need to modify the master's security group from the inception of the cluster in order to debug the driver (likewise with the slave's security group to debug the executor). Create a security group that allows outbound connections to your IP address and port of the debugger (i.e., TCP Outbound to 24.13.242.141:5005). You should create a security group with that one entry and add it to the master/slave job flow instance config's security groups with the AWS SDK (.withAdditionalMasterSecurityGroups(...)
). I'm not sure how to do this from the Web Console.
classpath "com.github.jengelman.gradle.plugins:shadow:1.2.4"
plugin. Also, enable Zip64
. You will upload the result of the :shadowJar
task to S3 to actually execute on AWS EMR.buildscript {
repositories {
mavenCentral()
maven {
url "https://plugins.gradle.org/m2/"
}
}
dependencies {
classpath "com.github.jengelman.gradle.plugins:shadow:1.2.4"
}
}
apply plugin: "com.github.johnrengelman.shadow"
shadowJar {
zip64 true
}
--deploy-mode cluster
and --master yarn
(basically undocumented).sc.hadoopConfiguration()
(e.g., configuration.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem");
). Do not configure these properties at all! hadoop-aws
works correctly by default in the EMR environment and has the appropriate properties set automatically.log4j
logging options to report only WARN
and higher. In this SDK, you would do this with:.withConfigurations(new Configuration()
.withClassification("spark-log4j")
.addPropertiesEntry("log4j.rootCategory", "WARN, console"))
containers/applications_.../container.../stderr.gz
log for errors before you bother debugging!maximizeResourceAllocation
configuration property for the spark
classification.new Configuration()
.withClassification("spark")
.addPropertiesEntry("maximizeResourceAllocation", "true"))
sc.close()
). Otherwise, Yarn will never start. Hilariously undocumented.ClassLoader.getSystemClassLoader()
. If class A
ordinarily in a.jar
wants to access a resource in b.jar
, and class B
is a class in b.jar
, use B.class.getClassLoader().getResource...
. Also, use relative paths (omit the forward slash in the beginning of your resource reference). I would suggest catching NullPointerException
s and trying both, so that your JAR works regardless of how it is packaged.Function
interfaces and similar, make sure to create a no-arg constructor that performs all the initialization you may depend on. Spark uses Kryo Serialization (as opposed to Java Serialization) for both closures and function instances, and if you neglect to provide a no-arg constructor with your application-specific initialization code (e.g., loading from resources), you won't perform all the initialization you're expecting to.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With