Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to work efficiently with SBT, Spark and "provided" dependencies?

I'm building an Apache Spark application in Scala and I'm using SBT to build it. Here is the thing:

  1. when I'm developing under IntelliJ IDEA, I want Spark dependencies to be included in the classpath (I'm launching a regular application with a main class)
  2. when I package the application (thanks to the sbt-assembly) plugin, I do not want Spark dependencies to be included in my fat JAR
  3. when I run unit tests through sbt test, I want Spark dependencies to be included in the classpath (same as #1 but from the SBT)

To match constraint #2, I'm declaring Spark dependencies as provided:

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-streaming" % sparkVersion % "provided",
  ...
)

Then, sbt-assembly's documentation suggests to add the following line to include the dependencies for unit tests (constraint #3):

run in Compile <<= Defaults.runTask(fullClasspath in Compile, mainClass in (Compile, run), runner in (Compile, run))

That leaves me with constraint #1 not being full-filled, i.e. I cannot run the application in IntelliJ IDEA as Spark dependencies are not being picked up.

With Maven, I was using a specific profile to build the uber JAR. That way, I was declaring Spark dependencies as regular dependencies for the main profile (IDE and unit tests) while declaring them as provided for the fat JAR packaging. See https://github.com/aseigneurin/kafka-sandbox/blob/master/pom.xml

What is the best way to achieve this with SBT?

like image 885
Alexis Seigneurin Avatar asked Apr 05 '16 21:04

Alexis Seigneurin


People also ask

Which is the correct way to add dependencies in sbt file?

Library dependencies can be added in two ways: unmanaged dependencies are jars dropped into the lib directory. managed dependencies are configured in the build definition and downloaded automatically from repositories.

What is provided dependency in sbt?

In that case, we can mark that dependency as “provided” in our build. sbt file. The “provided” keyword indicates that the dependency is provided by the runtime, so there's no need to include it in the JAR file. When using sbt-assembly, we may encounter an error caused by the default deduplicate merge strategy.

How do we specify library dependencies in sbt?

You can use both managed and unmanaged dependencies in your SBT projects. If you have JAR files (unmanaged dependencies) that you want to use in your project, simply copy them to the lib folder in the root directory of your SBT project, and SBT will find them automatically.

What is provided scope in sbt?

Scoping by the configuration axis Some configurations you'll see in sbt: Compile which defines the main build ( src/main/scala ). Test which defines how to build tests ( src/test/scala ). Runtime which defines the classpath for the run task.


4 Answers

Use the new 'Include dependencies with "Provided" scope' in an IntelliJ configuration.

IntelliJ config with Provided scope checkbox

like image 115
Martin Tapp Avatar answered Sep 29 '22 00:09

Martin Tapp


(Answering my own question with an answer I got from another channel...)

To be able to run the Spark application from IntelliJ IDEA, you simply have to create a main class in the src/test/scala directory (test, not main). IntelliJ will pick up the provided dependencies.

object Launch {   def main(args: Array[String]) {     Main.main(args)   } } 

Thanks Matthieu Blanc for pointing that out.

like image 37
Alexis Seigneurin Avatar answered Sep 29 '22 00:09

Alexis Seigneurin


You need to make the IntellJ work.

The main trick here is to create another subproject that will depend on the main subproject and will have all its provided libraries in compile scope. To do this I add the following lines to build.sbt:

lazy val mainRunner = project.in(file("mainRunner")).dependsOn(RootProject(file("."))).settings(
  libraryDependencies ++= spark.map(_ % "compile")
)

Now I refresh project in IDEA and slightly change previous run configuration so it will use new mainRunner module's classpath:

intellj

Works flawlessly for me.

Source: https://github.com/JetBrains/intellij-scala/wiki/%5BSBT%5D-How-to-use-provided-libraries-in-run-configurations

like image 41
Atais Avatar answered Sep 29 '22 00:09

Atais


For running the spark jobs, the general solution of "provided" dependencies work: https://stackoverflow.com/a/21803413/1091436

You can then run the app from either sbt, or Intellij IDEA, or anything else.

It basically boils down to this:

run in Compile := Defaults.runTask(fullClasspath in Compile, mainClass in (Compile, run), runner in (Compile, run)).evaluated,
runMain in Compile := Defaults.runMainTask(fullClasspath in Compile, runner in(Compile, run)).evaluated
like image 26
VasiliNovikov Avatar answered Sep 29 '22 01:09

VasiliNovikov