Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS EMR - IntelliJ Remote Debugging Spark Application

I'd like to debug a Spark Application that is running on an AWS EMR cluster. It would be fantastic if I could connect and debug it remotely using IntelliJ. I've searched but found very little.

Is it possible and if so, could someone kindly point me in the right direction?

Thanks.

like image 479
null Avatar asked Nov 09 '16 12:11

null


1 Answers

First, I would caution that what you're trying to do is basically impossible, due to the multitudinous bugs and unexpected use cases of AWS EMR. I would strongly recommend paying for the largest single instance you can to run your job (they have c4.8xlarge at the affordable end and x1.32xlarge for the real crazies!), and simply installing spark inside that instance and running your job.

Prerequisites

  • Your VPC must be correctly configured to allow any connectivity to the outside world at all. This means your Internet Gateway works correctly. You can test this by launching a cluster with an EC2 key pair, modifying the security group of the master to allow SSH connections from your machine (they naturally don't do this by default) and trying to connect to the master from your machine. If you can't do this, you will not be able to debug. I couldn't even meet this prerequisite on a fresh cluster with no additional configuration!
  • The machine that is running IntelliJ for debugging must be accessible from the Internet. To test this, modify the master instance's security group to allow outbound connections to your machine on port 5005. Then, run nc -l 5005 on your machine. SSH into your master and try echo "test" | nc your_ip_address 5005. Until you see test on your machine's terminal, do not proceed.

IntelliJ settings

Create a new Remote configuration. Change Debugger Mode to Listen. Name the configuration and save it. When you hit debug, it will be waiting for a connection. In that window, you would see "Command line arguments for running remote JVM", reading something like:

-agentlib:jdwp=transport=dt_socket,server=n,address=localhost:5005,suspend=y

You can remove the onthrow and oncaught lines like I did. Suppose your debugging machine is accessible over the Internet at 24.13.242.141. Pretend it actually read:

-agentlib:jdwp=transport=dt_socket,server=n,address=24.13.242.141:5005,suspend=y

We'll use this to setup debugging on the Spark processes.

Spark settings

There are two processes that can be debugged: the driver process (executing the code where your SparkContext is instantiated) and the executor process. Ultimately, you will pass these JVM options to a special argument to spark-submit to get the connection to occur. For debugging the driver, use

spark-submit --driver-java-options -agentlib:jdwp=transport=dt_socket,server=n,address=24.13.242.141:5005,suspend=y --class ...

For debugging executor processes, you would use a configuration option:

spark-submit --conf "spark.executor.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=n,address=24.13.242.141:5005,suspend=y" --class ...

Debugging the executor is extra tricky, since there will be multiple processes. You can't really debug multiple processes the way you imagine in IntelliJ. Also, you can't really limit the number of executors to 1 in AWS EMR, even when they claim you can. I believe it is okay if the other executors will fail (they will when they can't connect to your debugging session). But this step is untested.

Putting it all together

You can modify arguments to spark-submit both with the SDK and the Web Console. Note, in the SDK, you should not attempt to concatenate the "args" yourself—pass them as array items like it asks you to.

You will need to modify the master's security group from the inception of the cluster in order to debug the driver (likewise with the slave's security group to debug the executor). Create a security group that allows outbound connections to your IP address and port of the debugger (i.e., TCP Outbound to 24.13.242.141:5005). You should create a security group with that one entry and add it to the master/slave job flow instance config's security groups with the AWS SDK (.withAdditionalMasterSecurityGroups(...)). I'm not sure how to do this from the Web Console.

Some common gotchas

  • Make sure to use Gradle to produce a shadow Jar with the classpath "com.github.jengelman.gradle.plugins:shadow:1.2.4" plugin. Also, enable Zip64. You will upload the result of the :shadowJar task to S3 to actually execute on AWS EMR.
buildscript {
    repositories {
        mavenCentral()
        maven {
            url "https://plugins.gradle.org/m2/"
        }
    }
    dependencies {
        classpath "com.github.jengelman.gradle.plugins:shadow:1.2.4"
    }
}

apply plugin: "com.github.johnrengelman.shadow"

shadowJar {
    zip64 true
}
  • Make sure to start your Spark application with --deploy-mode cluster and --master yarn (basically undocumented).
  • In order to access S3 from inside the driver or executors on EMR, do not do the rigamarole of modifying sc.hadoopConfiguration() (e.g., configuration.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem");). Do not configure these properties at all! hadoop-aws works correctly by default in the EMR environment and has the appropriate properties set automatically.
  • Set your log4j logging options to report only WARN and higher. In this SDK, you would do this with:
.withConfigurations(new Configuration()
    .withClassification("spark-log4j")
    .addPropertiesEntry("log4j.rootCategory", "WARN, console"))
  • Check your containers/applications_.../container.../stderr.gz log for errors before you bother debugging!
  • If you see this error, "WARN YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources", in your container logs, make sure to add the maximizeResourceAllocation configuration property for the spark classification.
new Configuration()
        .withClassification("spark")
        .addPropertiesEntry("maximizeResourceAllocation", "true"))
  • Don't forget to close your context at the end of the driver program (sc.close()). Otherwise, Yarn will never start. Hilariously undocumented.
  • Resources in shadow JARs can only be loaded by a class inside the same "JAR" as the resource. In other words, do not use ClassLoader.getSystemClassLoader(). If class A ordinarily in a.jar wants to access a resource in b.jar, and class B is a class in b.jar, use B.class.getClassLoader().getResource.... Also, use relative paths (omit the forward slash in the beginning of your resource reference). I would suggest catching NullPointerExceptions and trying both, so that your JAR works regardless of how it is packaged.
  • If you use classes that implement the Function interfaces and similar, make sure to create a no-arg constructor that performs all the initialization you may depend on. Spark uses Kryo Serialization (as opposed to Java Serialization) for both closures and function instances, and if you neglect to provide a no-arg constructor with your application-specific initialization code (e.g., loading from resources), you won't perform all the initialization you're expecting to.
like image 70
DoctorPangloss Avatar answered Oct 29 '22 00:10

DoctorPangloss