Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to save models from ML Pipeline to S3 or HDFS?

I am trying to save thousands of models produced by ML Pipeline. As indicated in the answer here, the models can be saved as follows:

import java.io._

def saveModel(name: String, model: PipelineModel) = {
  val oos = new ObjectOutputStream(new FileOutputStream(s"/some/path/$name"))
  oos.writeObject(model)
  oos.close
}

schools.zip(bySchoolArrayModels).foreach{
  case (name, model) => saveModel(name, Model)
}

I have tried using s3://some/path/$name and /user/hadoop/some/path/$name as I would like the models to be saved to amazon s3 eventually but they both fail with messages indicating the path cannot be found.

How to save models to Amazon S3?

like image 915
SH Y. Avatar asked Aug 30 '15 01:08

SH Y.


People also ask

How do you save a spark model?

You can save your model by using the save method of mllib models. After storing it you can load it in another application. As @zero323 stated before, there is another way to achieve this, and is by using the Predictive Model Markup Language (PMML).

What is the difference between spark ML and spark MLlib?

Choosing Between Spark MLlib and Spark ML At first glance, the most obvious difference between MLlib and ML is the data types they work on, with MLlib supporting RDDs and ML supporting DataFrame s and Dataset s.

Why do we use pipelines in spark ML?

ML Pipelines provide a uniform set of high-level APIs built on top of DataFrames that help users create and tune practical machine learning pipelines.

What is PipelineModel?

A ML pipeline (or a ML workflow) is a sequence of Transformers and Estimators to fit a PipelineModel to an input dataset. pipeline: DataFrame =[fit]=> DataFrame (using transformers and estimators)


2 Answers

One way to save a model to HDFS is as following:

// persist model to HDFS
sc.parallelize(Seq(model), 1).saveAsObjectFile("hdfs:///user/root/linReg.model")

Saved model can then be loaded as:

val linRegModel = sc.objectFile[LinearRegressionModel]("linReg.model").first()

For more details see (ref)

like image 84
Neil Avatar answered Sep 27 '22 23:09

Neil


Since Apache-Spark 1.6 and in the Scala API, you can save your models without using any tricks. Because, all models from the ML library come with a save method, you can check this in the LogisticRegressionModel, indeed it has that method. By the way to load the model you can use a static method.

val logRegModel = LogisticRegressionModel.load("myModel.model")
like image 38
Alberto Bonsanto Avatar answered Sep 28 '22 00:09

Alberto Bonsanto