I have code which essentially looks like this:
class FoodTrainer(images: S3Path) { // data is >100GB file living in S3
def train(): FoodClassifier // Very expensive - takes ~5 hours!
}
class FoodClassifier { // Light-weight API class
def isHotDog(input: Image): Boolean
}
I want to at JAR-assembly (sbt assembly
) time, invoke val classifier = new FoodTrainer(s3Dir).train()
and publish the JAR which has the classifier
instance instantly available to downstream library users.
What is the easiest way to do this? What are some established paradigms for this? I know its a fairly common idiom in ML projects to publish trained models e.g. http://nlp.stanford.edu/software/stanford-corenlp-models-current.jar
How do I do this using sbt assembly
where I do not have to check in a large model class or data file into my version control?
To build a jar file with your application in case if you have no external dependencies, you can run sbt package and it will build a hello-world_2. 11_1. 0. jar file with your code so you can run it with java -jar hello-world.
All new SBT versions (after 0.7. x ) by default put the downloaded JARS into the . ivy2 directory in your home directory. If you are using Linux, this is usually /home/<username>/.
You should serialize the data which results from training into its own file. You can then package this data file in your JAR. Your production code opens the file and reads it rather than run the training algorithm.
The steps are as follows.
During the resource generation phase of build:
resourceGenerators in Compile += Def.task { val classifier = new FoodTrainer(s3Dir).train() val contents = FoodClassifier.serialize(classifier) val file = (resourceManaged in Compile).value / "mypackage" / "food-classifier.model" IO.write(file, contents) Seq(file) }.taskValue
jar
file automatically and it won't appear in source tree.object FoodClassifierModel { lazy val classifier = readResource("/mypackage/food-classifier.model") def readResource(resourceName: String): FoodClassifier = { val stream = getClass.getResourceAsStream(resourceName) val lines = scala.io.Source.fromInputStream( stream ).getLines val contents = lines.mkString("\n") FoodClassifier.parse(contents) } } object FoodClassifier { def parse(content: String): FoodClassifier def serialize(classfier: FoodClassifier): String }
Of course, as your data is rather big, you'll need to use streaming serializers and parsers to not overload java heap space. The above just shows how to package resource at build time.
See http://www.scala-sbt.org/1.x/docs/Howto-Generating-Files.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With