I'm kind of new to deployment in scala and I configured the sbt-assembly
plugin, all worked well.
Some days ago I added hadoop, spark and some other dependencies, then the assembly
task become extremely slow (8 to 10 minutes) and before that, it was < 30s. Most of the time is used for generating the assembly-jar (it takes several seconds for the jar to grow 1MB in size).
I observed that there's a lot of merge conflicts, which are resolved by first
strategy. Does this affects the speed of assembly?
I've played with the -Xmx option for sbt (add -Xmx4096m) but it doesn't help.
I'm using sbt
12.4 and sbt-assembly
. Any suggestions or pointers for optimize this task?
So 0__'s comment is right on:
Have you read the Readme. It specifically suggests that you might change the
cacheUnzip
andcacheOutput
settings. I would give it a try.
cacheUnzip
is an optimization feature, but cacheOutput
isn't. The purpose of cacheOutput
is so that you get the identical jar when your source has not changed. For some people, it's important to that output jars don't change unnecessarily. The caveat is that it's checking the SHA-1 hash of all *.class files. So the readme says:
If there are a large number of class files, this could take a long time
From what I can tell, unzipping and application of merge strategy together takes around a minute or two, but the checking of the SHA-1 seems to take forever. Here's assembly.sbt
that turns off the output cache:
import AssemblyKeys._ // put this at the top of the file
assemblySettings
mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) => {
case PathList("javax", "servlet", xs @ _*) => MergeStrategy.first
case PathList("org", "apache", "commons", xs @ _*) => MergeStrategy.first // commons-beanutils-core-1.8.0.jar vs commons-beanutils-1.7.0.jar
case PathList("com", "esotericsoftware", "minlog", xs @ _*) => MergeStrategy.first // kryo-2.21.jar vs minlog-1.2.jar
case "about.html" => MergeStrategy.rename
case x => old(x)
}
}
assemblyCacheOutput in assembly := false
The assembly finished in 58 seconds after cleaning, and the second run without cleaning took 15 seconds. Although some of the runs took 200+ secs too.
Looking at the source, I probably could optimize cacheOutput
, but for now, turning it off should make assembly much faster.
Edit:
I've added #96 Performance degradation when adding library dependencies based on this question, and added some fixes in sbt-assembly 0.10.1 for sbt 0.13.
sbt-assembly 0.10.1 avoids content hashing of the unzipped items of the dependent library jars. It also skips jar caching done by sbt, since sbt-assembly is already caching the output.
The changes make assembly task run more consistently. Using deps-heavy spark as sample project, assembly task was run 15 times after a small source change. sbt-assembly 0.10.0 took 19+/-157 seconds (mostly within 20 secs, but going 150+ secs 26% of the runs). On the other hand, sbt-assembly 0.10.1 took 16+/-1 seconds.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With