I have code which essentially looks like this: <pre class="prettyprint"><code>class FoodTrainer(images: S3Path) { // data is >100GB file living in S3 def train(): FoodClassifier // Very expensive - takes ~5 hours! } class FoodClassifier { // Light-weight API class def isHotDog(input: Image): Boolean } </code></pre> I want to at JAR-assembly (<code>sbt assembly</code>) time, invoke <code>val classifier = new FoodTrainer(s3Dir).train()</code> and publish the JAR which has the <code>classifier</code> instance instantly available to downstream library users. What is the easiest way to do this? What are some established paradigms for this? I know its a fairly common idiom in ML projects to publish trained models e.g. http://nlp.stanford.edu/software/stanford-corenlp-models-current.jar How do I do this using <code>sbt assembly</code> where I do not have to check in a large model class or data file into my version control?

The steps are as follows. During the resource generation phase of build: <ol> <li>Generate model during resource generation phase of build.</li> <li>Serialize the contents of the model to a file in a managed resources folder. <pre class="prettyprint"> resourceGenerators in Compile += Def.task { val classifier = new FoodTrainer(s3Dir).train() val contents = FoodClassifier.serialize(classifier) val file = (resourceManaged in Compile).value / "mypackage" / "food-classifier.model" IO.write(file, contents) Seq(file) }.taskValue </pre> </li> <li>The resource will be included in <code>jar</code> file automatically and it won't appear in source tree.</li> <li>To load the model just add code that reads resource and parses the model. <pre class="prettyprint"> object FoodClassifierModel { lazy val classifier = readResource("/mypackage/food-classifier.model") def readResource(resourceName: String): FoodClassifier = { val stream = getClass.getResourceAsStream(resourceName) val lines = scala.io.Source.fromInputStream( stream ).getLines val contents = lines.mkString("\n") FoodClassifier.parse(contents) } } object FoodClassifier { def parse(content: String): FoodClassifier def serialize(classfier: FoodClassifier): String } </pre> </li> </ol> Of course, as your data is rather big, you'll need to use streaming serializers and parsers to not overload java heap space. The above just shows how to package resource at build time. See http://www.scala-sbt.org/1.x/docs/Howto-Generating-Files.html

SBT: How to package an instance of a class as a JAR?

Tags:

java

jar

scala

sbt

sbt-assembly

I have code which essentially looks like this:

class FoodTrainer(images: S3Path) { // data is >100GB file living in S3
  def train(): FoodClassifier       // Very expensive - takes ~5 hours!
}

class FoodClassifier {          // Light-weight API class
  def isHotDog(input: Image): Boolean
}

I want to at JAR-assembly (sbt assembly) time, invoke val classifier = new FoodTrainer(s3Dir).train() and publish the JAR which has the classifier instance instantly available to downstream library users.

What is the easiest way to do this? What are some established paradigms for this? I know its a fairly common idiom in ML projects to publish trained models e.g. http://nlp.stanford.edu/software/stanford-corenlp-models-current.jar

How do I do this using sbt assembly where I do not have to check in a large model class or data file into my version control?

482

asked Nov 08 '17 16:11

pathikrit

2 Answers

You should serialize the data which results from training into its own file. You can then package this data file in your JAR. Your production code opens the file and reads it rather than run the training algorithm.

180

answered Oct 17 '22 04:10

Code-Apprentice

The steps are as follows.

During the resource generation phase of build:

Generate model during resource generation phase of build.

Serialize the contents of the model to a file in a managed resources folder.

resourceGenerators in Compile += Def.task {
  val classifier = new FoodTrainer(s3Dir).train()
  val contents = FoodClassifier.serialize(classifier)
  val file = (resourceManaged in Compile).value / "mypackage" / "food-classifier.model"
  IO.write(file, contents)
  Seq(file)
}.taskValue

The resource will be included in jar file automatically and it won't appear in source tree.

To load the model just add code that reads resource and parses the model.

object FoodClassifierModel {
  lazy val classifier = readResource("/mypackage/food-classifier.model")
  def readResource(resourceName: String): FoodClassifier = {
    val stream = getClass.getResourceAsStream(resourceName)
    val lines = scala.io.Source.fromInputStream( stream ).getLines
    val contents = lines.mkString("\n")
    FoodClassifier.parse(contents)
  }
}
object FoodClassifier {
  def parse(content: String): FoodClassifier
  def serialize(classfier: FoodClassifier): String
}

Of course, as your data is rather big, you'll need to use streaming serializers and parsers to not overload java heap space. The above just shows how to package resource at build time.

See http://www.scala-sbt.org/1.x/docs/Howto-Generating-Files.html

answered Oct 17 '22 04:10

Arseniy Zhizhelev

Related questions
                            
                                Export nested BigQuery data to cloud storage
                            
                                Using method reference to remove elements from a List
                            
                                Kotlin: appendText and closing resources
                            
                                Why not multiple abstract methods in Functional Interface in Java8? [duplicate]
                            
                                Java method call is ambiguous
                            
                                Spring Boot Oauth2 Extending DefaultTokenServices
                            
                                how to add success/error flag while returning list of object as a response
                            
                                CAPTURE_AUDIO_OUTPUT not asked for permission at runtime
                            
                                Read a file from GCS in Apache Beam
                            
                                Access to "parent scope" in JShell
                            
                                Importing Package-Private Classes to JShell
                            
                                SwingNode contents not resizing when the SwingNode's parent resizes
                            
                                Using disposed observer does not re-subscribe to the source
                            
                                How to connect a java client to SignalR Hub using only websockets?
                            
                                Transfer data between two activities [duplicate]
                            
                                How to remove child objects from a @ManyToMany relation with lots of children in JPA and Hibernate
                            
                                Java 9 automatic module dependencies cannot be resolved / module not found
                            
                                Java Streams toArray with primitives
                            
                                Java type inference of generic exception type
                            
                                Receiving CameraAccessException: CAMERA_ERROR (3) on CaptureSession.setRepeatingRequest()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With