What is the precedence in class loading when both the uber jar of my spark application and the contents of --jars option to my spark-submit shell command contain similar dependencies ?
I ask this from a third-party library integration standpoint. If I set --jars to use a third-party library at version 2.0 and the uber jar coming into this spark-submit script was assembled using version 2.1, which class is loaded at runtime ?
At present, I think of keeping my dependencies on hdfs, and add them to the --jars option on spark-submit, while hoping via some end-user documentation to ask users to set the scope of this third-party library to be 'provided' in their spark application's maven pom file.
You can also add jars using Spark submit option --jar , using this option you can add a single jar or multiple jars by comma-separated.
Only one is set through Spark submit and one via code. Choose the one which suits you better. One important thing to note is that using either of these options does not add the JAR file to your driver/executor classpath. You'll need to explicitly add them using the extraClassPath configuration on both.
Use --jars option To add JARs to a Spark job, --jars option can be used to include JARs on Spark driver and executor classpaths. If multiple JAR files need to be included, use comma to separate them. The following is an example: spark-submit --jars /path/to/jar/file1,/path/to/jar/file2 ...
Spark JAR files let you package a project into a single file so it can be run on a Spark cluster. A lot of developers develop Spark code in brower based notebooks because they're unfamiliar with JAR files.
This is somewhat controlled with params:
spark.driver.userClassPathFirst
&spark.executor.userClassPathFirst
If set to true (default
is false), from docs:
(Experimental) Whether to give user-added jars precedence over Spark's own jars when loading classes in the the driver. This feature can be used to mitigate conflicts between Spark's dependencies and user dependencies. It is currently an experimental feature. This is used in cluster mode only.
I wrote some of the code that controls this, and there were a few bugs in the early releases, but if you're using a recent Spark release it should work (although it is still an experimental feature).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With