In the official Spark documentation it is explained that the term application jar corresponds to:
A jar containing the user's Spark application. In some cases users will want to create an "uber jar" containing their application along with its dependencies. The user's jar should never include Hadoop or Spark libraries, however, these will be added at runtime
This can be easily taken care of by using 'Provided' scope in maven or sbt:
"org.apache.spark" % "spark-core_2.10" % sparkVersion % Provided
However, I might be missing something obvious here, but I could not find a straight answer, what are the specific libraries that will be added at runtime? Will it be just the core ones (e.g. spark-core, hadoop-core) or will some others (example: spark-streaming, hadoop-hdfs) be added as well?
Is there a way to check this and get the actual list of Spark dependencies that will be added at runtime and hence can be marked as Provided?
The short answer is: all Spark libraries and the hadoop version you chose when you downloaded Spark from the download page.
The longer answer is: it depends on the deployment mode you're using:
Local Mode: since there's only one JVM in local mode, and that's the Driver application's JVM - then it depends on how you packaged your driver app. If you didn't (e.g. running directly from IDE), then defining a dependency as "provided" doesn't mean anything, so whichever libraries you have in your SBT file will be present in runtime. If you package your driver application and try to run it with Spark marked as provided, you'll probably see failures unless you bring these jars into the mix some other way (but local mode isn't really meant for that anyway...).
Standalone Mode: if you deployed one of the pre-built packages available at the download page onto your cluster (master and worker machines), they contain all of Spark's libraries (including Spark SQL, Streaming, GraphX...) and the Hadoop version you chose. If you deployed jars that you built yourself - well, then it depends on what and how you packaged...
Yarn Mode: when you submit a Spark application to a YARN manager, you set the Spark jar location for the application to use (via parameter spark.yarn.jar
) - whatever that jar (or jars) contain will be loaded. Once again - if that jar is one of the pre-built ones, it contains all Spark libraries and the chosen Hadoop version
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With