The common problems when building and deploying Spark applications are: <ul> <li> <code>java.lang.ClassNotFoundException</code>.</li> <li> <code>object x is not a member of package y</code> compilation errors.</li> <li><code>java.lang.NoSuchMethodError</code></li> </ul> How these can be resolved?

Apache Spark's classpath is built dynamically (to accommodate per-application user code) which makes it vulnerable to such issues. @user7337271's answer is correct, but there are some more concerns, depending on the cluster manager ("master") you're using. First, a Spark application consists of these components (each one is a separate JVM, therefore potentially contains different classes in its classpath): <ol> <li> Driver: that's your application creating a <code>SparkSession</code> (or <code>SparkContext</code>) and connecting to a cluster manager to perform the actual work</li> <li> Cluster Manager: serves as an "entry point" to the cluster, in charge of allocating executors for each application. There are several different types supported in Spark: standalone, YARN and Mesos, which we'll describe bellow.</li> <li> Executors: these are the processes on the cluster nodes, performing the actual work (running Spark tasks)</li> </ol> The relationsip between these is described in this diagram from Apache Spark's cluster mode overview: <img src="https://i.stack.imgur.com/yqyxj.png" alt="Cluster Mode Overview"> Now - which classes should reside in each of these components? This can be answered by the following diagram: <img src="https://i.stack.imgur.com/eGByZ.png" alt="Class placement overview"> Let's parse that slowly: <ol> <li>Spark Code are Spark's libraries. They should exist in ALL three components as they include the glue that let's Spark perform the communication between them. By the way - Spark authors made a design decision to include code for ALL components in ALL components (e.g. to include code that should only run in Executor in driver too) to simplify this - so Spark's "fat jar" (in versions up to 1.6) or "archive" (in 2.0, details bellow) contain the necessary code for all components and should be available in all of them.</li> <li>Driver-Only Code this is user code that does not include anything that should be used on Executors, i.e. code that isn't used in any transformations on the RDD / DataFrame / Dataset. This does not necessarily have to be separated from the distributed user code, but it can be. </li> <li>Distributed Code this is user code that is compiled with driver code, but also has to be executed on the Executors - everything the actual transformations use must be included in this jar(s). </li> </ol> Now that we got that straight, how do we get the classes to load correctly in each component, and what rules should they follow? <ol> <li> Spark Code: as previous answers state, you must use the same Scala and Spark versions in all components. 1.1 In Standalone mode, there's a "pre-existing" Spark installation to which applications (drivers) can connect. That means that all drivers must use that same Spark version running on the master and executors. 1.2 In YARN / Mesos, each application can use a different Spark version, but all components of the same application must use the same one. That means that if you used version X to compile and package your driver application, you should provide the same version when starting the SparkSession (e.g. via <code>spark.yarn.archive</code> or <code>spark.yarn.jars</code> parameters when using YARN). The jars / archive you provide should include all Spark dependencies (including transitive dependencies), and it will be shipped by the cluster manager to each executor when the application starts. </li> <li>Driver Code: that's entirely up to - driver code can be shipped as a bunch of jars or a "fat jar", as long as it includes all Spark dependencies + all user code</li> <li>Distributed Code: in addition to being present on the Driver, this code must be shipped to executors (again, along with all of its transitive dependencies). This is done using the <code>spark.jars</code> parameter.</li> </ol> To summarize, here's a suggested approach to building and deploying a Spark Application (in this case - using YARN): <ul> <li>Create a library with your distributed code, package it both as a "regular" jar (with a .pom file describing its dependencies) and as a "fat jar" (with all of its transitive dependencies included). </li> <li>Create a driver application, with compile-dependencies on your distributed code library and on Apache Spark (with a specific version)</li> <li>Package the driver application into a fat jar to be deployed to driver</li> <li>Pass the right version of your distributed code as the value of <code>spark.jars</code> parameter when starting the <code>SparkSession</code> </li> <li>Pass the location of an archive file (e.g. gzip) containing all the jars under <code>lib/</code> folder of the downloaded Spark binaries as the value of <code>spark.yarn.archive</code> </li> </ul>

Resolving dependency problems in Apache Spark

1 Answers

Apache Spark's classpath is built dynamically (to accommodate per-application user code) which makes it vulnerable to such issues. @user7337271's answer is correct, but there are some more concerns, depending on the cluster manager ("master") you're using.

First, a Spark application consists of these components (each one is a separate JVM, therefore potentially contains different classes in its classpath):

Driver: that's your application creating a SparkSession (or SparkContext) and connecting to a cluster manager to perform the actual work
Cluster Manager: serves as an "entry point" to the cluster, in charge of allocating executors for each application. There are several different types supported in Spark: standalone, YARN and Mesos, which we'll describe bellow.
Executors: these are the processes on the cluster nodes, performing the actual work (running Spark tasks)

The relationsip between these is described in this diagram from Apache Spark's cluster mode overview:

Cluster Mode Overview

Now - which classes should reside in each of these components?

This can be answered by the following diagram:

Class placement overview

Let's parse that slowly:

Spark Code are Spark's libraries. They should exist in ALL three components as they include the glue that let's Spark perform the communication between them. By the way - Spark authors made a design decision to include code for ALL components in ALL components (e.g. to include code that should only run in Executor in driver too) to simplify this - so Spark's "fat jar" (in versions up to 1.6) or "archive" (in 2.0, details bellow) contain the necessary code for all components and should be available in all of them.
Driver-Only Code this is user code that does not include anything that should be used on Executors, i.e. code that isn't used in any transformations on the RDD / DataFrame / Dataset. This does not necessarily have to be separated from the distributed user code, but it can be.
Distributed Code this is user code that is compiled with driver code, but also has to be executed on the Executors - everything the actual transformations use must be included in this jar(s).

Now that we got that straight, how do we get the classes to load correctly in each component, and what rules should they follow?

Spark Code: as previous answers state, you must use the same Scala and Spark versions in all components.

1.1 In Standalone mode, there's a "pre-existing" Spark installation to which applications (drivers) can connect. That means that all drivers must use that same Spark version running on the master and executors.

1.2 In YARN / Mesos, each application can use a different Spark version, but all components of the same application must use the same one. That means that if you used version X to compile and package your driver application, you should provide the same version when starting the SparkSession (e.g. via spark.yarn.archive or spark.yarn.jars parameters when using YARN). The jars / archive you provide should include all Spark dependencies (including transitive dependencies), and it will be shipped by the cluster manager to each executor when the application starts.
Driver Code: that's entirely up to - driver code can be shipped as a bunch of jars or a "fat jar", as long as it includes all Spark dependencies + all user code
Distributed Code: in addition to being present on the Driver, this code must be shipped to executors (again, along with all of its transitive dependencies). This is done using the spark.jars parameter.

To summarize, here's a suggested approach to building and deploying a Spark Application (in this case - using YARN):

Create a library with your distributed code, package it both as a "regular" jar (with a .pom file describing its dependencies) and as a "fat jar" (with all of its transitive dependencies included).
Create a driver application, with compile-dependencies on your distributed code library and on Apache Spark (with a specific version)
Package the driver application into a fat jar to be deployed to driver
Pass the right version of your distributed code as the value of spark.jars parameter when starting the SparkSession
Pass the location of an archive file (e.g. gzip) containing all the jars under lib/ folder of the downloaded Spark binaries as the value of spark.yarn.archive

162

answered Oct 02 '22 15:10

Tzach Zohar

Related questions
                            
                                VisualVM: CPU/Memory profiler stuck at "Connecting to the target JVM..."
                            
                                What's the point of package annotations?
                            
                                Are resources closed before or after the finally?
                            
                                Enhanced 'for' loop and lambda expressions
                            
                                Eclipse: "Update SVN cache" hangs and locks up
                            
                                The difference between head & tail recursion [duplicate]
                            
                                How to configure a proxy server for both HTTP and HTTPS in Maven's settings.xml?
                            
                                How can I run Kotlin-Script (.kts) files from within Kotlin/Java?
                            
                                PrintWriter vs FileWriter in Java
                            
                                Difference between java 8 streams and parallel streams
                            
                                Are Kotlin data types built off primitive or non-primitive Java data types?
                            
                                Where in maven project's path should I put configuration files that are not considered resources
                            
                                If I compiled a Java file with the newest JDK, would an older JVM be able to run the .class files?
                            
                                Force a View to redraw itself
                            
                                What is the syntax of the enhanced for loop in Java?
                            
                                Why doesn't Java have true multidimensional arrays?
                            
                                java socket / output stream writes : do they block?
                            
                                Java/Tomcat standalone, how to log/access all the HTTP GET requests
                            
                                Will .hashcode() return a different int due to compaction of tenure space?
                            
                                Calling base class overridden function from base class method

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Resolving dependency problems in Apache Spark

Tags:

java

scala

classnotfoundexception

apache-spark

nosuchmethoderror

user7337271

People also ask

1 Answers

Tzach Zohar

Recent Activity

Donate For Us