Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sbt assembly shading to create fat jar to run on spark

I'm using sbt assembly to create a fat jar which can run on spark. Have dependencies on grpc-netty. Guava version on spark is older than the one required by grpc-netty and I run into this error: java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument. I was able to resolve this by setting userClassPathFirst to true on spark, but leads to other errors.

Correct me if I am wrong, but from what I understand, I shouldn't have to set userClassPathFirst to true if I do shading correctly. Here's how I do shading now:

assemblyShadeRules in assembly := Seq(
  ShadeRule.rename("com.google.guava.**" -> "my_conf.@1")
    .inLibrary("com.google.guava" % "guava" % "20.0")
    .inLibrary("io.grpc" % "grpc-netty" % "1.1.2")
)

libraryDependencies ++= Seq(
  "org.scalaj" %% "scalaj-http" % "2.3.0",
  "org.json4s" %% "json4s-native" % "3.2.11",
  "org.json4s" %% "json4s-jackson" % "3.2.11",
  "org.apache.spark" %% "spark-core" % "2.2.0" % "provided",
  "org.apache.spark" % "spark-sql_2.11" % "2.2.0" % "provided",
  "org.clapper" %% "argot" % "1.0.3",
  "com.typesafe" % "config" % "1.3.1",
  "com.databricks" %% "spark-csv" % "1.5.0",
  "org.apache.spark" % "spark-mllib_2.11" % "2.2.0" % "provided",
  "io.grpc" % "grpc-netty" % "1.1.2",
  "com.google.guava" % "guava" % "20.0"
)

What am I doing wrong here and how do I fix it?

like image 757
Kumar Bharath Prabhu Avatar asked Aug 31 '17 19:08

Kumar Bharath Prabhu


People also ask

How do I run a JAR file in sbt?

A JAR file created by SBT can be run by the Scala interpreter, but not the Java interpreter. This is because class files in the JAR file created by sbt package have dependencies on Scala class files (Scala libraries), which aren't included in the JAR file SBT generates.

What is a shaded dependency?

Shading is a process where a dependency is relocated to a different Java package and copied into the same JAR file as the code that relies on that dependency. The main purpose of shading is to avoid conflicts between the versions of dependencies used by a library and the versions used by the consumers of that library.

What is Assembly sbt?

The sbt-assembly plugin is an SBT plugin for building a single independent fat JAR file with all dependencies included. This is inspired by the popular Maven assembly plugin, which is used to build fat JARs in Maven.

What is a spark JAR file?

Spark JAR files let you package a project into a single file so it can be run on a Spark cluster. A lot of developers develop Spark code in brower based notebooks because they're unfamiliar with JAR files.


1 Answers

You are almost there. What shadeRule does is it renames class names, not library names:

The main ShadeRule.rename rule is used to rename classes. All references to the renamed classes will also be updated.

In fact, in com.google.guava:guava there are no classes with package com.google.guava:

$ jar tf ~/Downloads/guava-20.0.jar  | sed -e 's:/[^/]*$::' | sort | uniq
META-INF
META-INF/maven
META-INF/maven/com.google.guava
META-INF/maven/com.google.guava/guava
com
com/google
com/google/common
com/google/common/annotations
com/google/common/base
com/google/common/base/internal
com/google/common/cache
com/google/common/collect
com/google/common/escape
com/google/common/eventbus
com/google/common/graph
com/google/common/hash
com/google/common/html
com/google/common/io
com/google/common/math
com/google/common/net
com/google/common/primitives
com/google/common/reflect
com/google/common/util
com/google/common/util/concurrent
com/google/common/xml
com/google/thirdparty
com/google/thirdparty/publicsuffix

It should be enough to change your shading rule to this:

assemblyShadeRules in assembly := Seq(
  ShadeRule.rename("com.google.common.**" -> "my_conf.@1")
    .inLibrary("com.google.guava" % "guava" % "20.0")
    .inLibrary("io.grpc" % "grpc-netty" % "1.1.2")
)

So you don't need to change userClassPathFirst.

Moreover, you can simplify your shading rule like this:

assemblyShadeRules in assembly := Seq(
  ShadeRule.rename("com.google.common.**" -> "my_conf.@1").inAll
)

Since org.apache.spark dependencies are provided, they will not be included in your jar and will not be shaded (hence spark will use its own unshaded version of guava that it has on the cluster).

like image 60
Nikolay Vasiliev Avatar answered Nov 09 '22 02:11

Nikolay Vasiliev