Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SBT: How to package an instance of a class as a JAR?

I have code which essentially looks like this:

class FoodTrainer(images: S3Path) { // data is >100GB file living in S3
  def train(): FoodClassifier       // Very expensive - takes ~5 hours!
}

class FoodClassifier {          // Light-weight API class
  def isHotDog(input: Image): Boolean
}

I want to at JAR-assembly (sbt assembly) time, invoke val classifier = new FoodTrainer(s3Dir).train() and publish the JAR which has the classifier instance instantly available to downstream library users.

What is the easiest way to do this? What are some established paradigms for this? I know its a fairly common idiom in ML projects to publish trained models e.g. http://nlp.stanford.edu/software/stanford-corenlp-models-current.jar

How do I do this using sbt assembly where I do not have to check in a large model class or data file into my version control?

like image 482
pathikrit Avatar asked Nov 08 '17 16:11

pathikrit


People also ask

How do I create a jar in Scala?

To build a jar file with your application in case if you have no external dependencies, you can run sbt package and it will build a hello-world_2. 11_1. 0. jar file with your code so you can run it with java -jar hello-world.

Where are sbt jars?

All new SBT versions (after 0.7. x ) by default put the downloaded JARS into the . ivy2 directory in your home directory. If you are using Linux, this is usually /home/<username>/.


2 Answers

You should serialize the data which results from training into its own file. You can then package this data file in your JAR. Your production code opens the file and reads it rather than run the training algorithm.

like image 180
Code-Apprentice Avatar answered Oct 17 '22 04:10

Code-Apprentice


The steps are as follows.

During the resource generation phase of build:

  1. Generate model during resource generation phase of build.
  2. Serialize the contents of the model to a file in a managed resources folder.
    resourceGenerators in Compile += Def.task {
      val classifier = new FoodTrainer(s3Dir).train()
      val contents = FoodClassifier.serialize(classifier)
      val file = (resourceManaged in Compile).value / "mypackage" / "food-classifier.model"
      IO.write(file, contents)
      Seq(file)
    }.taskValue
    
  3. The resource will be included in jar file automatically and it won't appear in source tree.
  4. To load the model just add code that reads resource and parses the model.
    object FoodClassifierModel {
      lazy val classifier = readResource("/mypackage/food-classifier.model")
      def readResource(resourceName: String): FoodClassifier = {
        val stream = getClass.getResourceAsStream(resourceName)
        val lines = scala.io.Source.fromInputStream( stream ).getLines
        val contents = lines.mkString("\n")
        FoodClassifier.parse(contents)
      }
    }
    object FoodClassifier {
      def parse(content: String): FoodClassifier
      def serialize(classfier: FoodClassifier): String
    }
    

Of course, as your data is rather big, you'll need to use streaming serializers and parsers to not overload java heap space. The above just shows how to package resource at build time.

See http://www.scala-sbt.org/1.x/docs/Howto-Generating-Files.html

like image 4
Arseniy Zhizhelev Avatar answered Oct 17 '22 04:10

Arseniy Zhizhelev