Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Including a Spark Package JAR file in a SBT generated fat JAR

The spark-daria project is uploaded to Spark Packages and I'm accessing spark-daria code in another SBT project with the sbt-spark-package plugin.

I can include spark-daria in the fat JAR file generated by sbt assembly with the following code in the build.sbt file.

spDependencies += "mrpowers/spark-daria:0.3.0"

val requiredJars = List("spark-daria-0.3.0.jar")
assemblyExcludedJars in assembly := {
  val cp = (fullClasspath in assembly).value
  cp filter { f =>
    !requiredJars.contains(f.data.getName)
  }
}

This code feels like a hack. Is there a better way to include spark-daria in the fat JAR file?

N.B. I want to build a semi-fat JAR file here. I want spark-daria to be included in the JAR file, but I don't want all of Spark in the JAR file!

like image 715
Powers Avatar asked May 17 '17 23:05

Powers


People also ask

How do I run a JAR file in sbt?

A JAR file created by SBT can be run by the Scala interpreter, but not the Java interpreter. This is because class files in the JAR file created by sbt package have dependencies on Scala class files (Scala libraries), which aren't included in the JAR file SBT generates.

What does sbt package do?

By default, sbt constructs a manifest for the binary package from settings such as organization and mainClass . Additional attributes may be added to the packageOptions setting scoped by the configuration and package task. Main attributes may be added with Package.


1 Answers

The README for version 0.2.6 states the following:

In any case where you really can't specify Spark dependencies using sparkComponents (e.g. you have exclusion rules) and configure them as provided (e.g. standalone jar for a demo), you may use spIgnoreProvided := true to properly use the assembly plugin.

You should then use this flag on your build definition and set your Spark dependencies as provided as I do with spark-sql:2.2.0 in the following example:

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0" % "provided"

Please note that by setting this your IDE may no longer have the necessary dependencies references to compile and run your code locally, which would mean that you would have to add the necessary JARs to the classpath by hand. I do this often on IntelliJ, what I do is having a Spark distribution on my machine and adding its jars directory to the IntelliJ project definition (this question may help you with that, should you need it).

like image 108
stefanobaghino Avatar answered Oct 21 '22 05:10

stefanobaghino