Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sbt assembly task runs slowly after adding some dependencies

I'm kind of new to deployment in scala and I configured the sbt-assembly plugin, all worked well.

Some days ago I added hadoop, spark and some other dependencies, then the assembly task become extremely slow (8 to 10 minutes) and before that, it was < 30s. Most of the time is used for generating the assembly-jar (it takes several seconds for the jar to grow 1MB in size).

I observed that there's a lot of merge conflicts, which are resolved by first strategy. Does this affects the speed of assembly?

I've played with the -Xmx option for sbt (add -Xmx4096m) but it doesn't help.

I'm using sbt 12.4 and sbt-assembly. Any suggestions or pointers for optimize this task?

like image 234
darkjh Avatar asked Oct 23 '13 13:10

darkjh


1 Answers

So 0__'s comment is right on:

Have you read the Readme. It specifically suggests that you might change the cacheUnzip and cacheOutput settings. I would give it a try.

cacheUnzip is an optimization feature, but cacheOutput isn't. The purpose of cacheOutput is so that you get the identical jar when your source has not changed. For some people, it's important to that output jars don't change unnecessarily. The caveat is that it's checking the SHA-1 hash of all *.class files. So the readme says:

If there are a large number of class files, this could take a long time

From what I can tell, unzipping and application of merge strategy together takes around a minute or two, but the checking of the SHA-1 seems to take forever. Here's assembly.sbt that turns off the output cache:

import AssemblyKeys._ // put this at the top of the file

assemblySettings

mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>  {
    case PathList("javax", "servlet", xs @ _*)         => MergeStrategy.first
    case PathList("org", "apache", "commons", xs @ _*) => MergeStrategy.first // commons-beanutils-core-1.8.0.jar vs commons-beanutils-1.7.0.jar
    case PathList("com", "esotericsoftware", "minlog", xs @ _*) => MergeStrategy.first // kryo-2.21.jar vs minlog-1.2.jar
    case "about.html"                                  => MergeStrategy.rename
    case x => old(x)
  }
}

assemblyCacheOutput in assembly := false

The assembly finished in 58 seconds after cleaning, and the second run without cleaning took 15 seconds. Although some of the runs took 200+ secs too.

Looking at the source, I probably could optimize cacheOutput, but for now, turning it off should make assembly much faster.

Edit:

I've added #96 Performance degradation when adding library dependencies based on this question, and added some fixes in sbt-assembly 0.10.1 for sbt 0.13.

sbt-assembly 0.10.1 avoids content hashing of the unzipped items of the dependent library jars. It also skips jar caching done by sbt, since sbt-assembly is already caching the output.

The changes make assembly task run more consistently. Using deps-heavy spark as sample project, assembly task was run 15 times after a small source change. sbt-assembly 0.10.0 took 19+/-157 seconds (mostly within 20 secs, but going 150+ secs 26% of the runs). On the other hand, sbt-assembly 0.10.1 took 16+/-1 seconds.

like image 193
Eugene Yokota Avatar answered Oct 23 '22 23:10

Eugene Yokota