Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to load a PMML model?

I'm following the instructions of PMML model export - spark.mllib to create a K-means model.

val numClusters = 10
val numIterations = 10
val clusters = KMeans.train(data, numClusters, numIterations)
// Save and load model: export to PMML
println("PMML Model:\n" + clusters.toPMML("/kmeans.xml"))

But I don't know how to load the PMML after that.

I'm trying

val sameModel = KMeansModel.load(sc, "/kmeans.xml")

and appears:

org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/kmeans.xml/metadata

Any idea?

Best regards

like image 731
P. Barbadew Avatar asked Jun 15 '16 14:06

P. Barbadew


2 Answers

As stated in the documentation (for the version you seem to be interested it - 1.6.1 and also for the latest available - 2.1.0) Spark supports exporting to PMML only. The load method actually expects to retrieve a model saved in Spark own format and this is why the load method expects a certain path to be there and why the exception has been thrown.

If you trained the model with Spark, you can save it and load it later.

If you need to load a model that has not been trained in Spark and has been saved as PMML you can use jpmml-spark to load and evaluate it.

like image 50
stefanobaghino Avatar answered Sep 29 '22 23:09

stefanobaghino


My limited experience in this spark.mllib's KMeans space is that it is not possible, but you could develop the feature yourself.

spark.mllib's KMeansModel is PMMLExportable:

class KMeansModel @Since("1.1.0") (@Since("1.0.0") val clusterCenters: Array[Vector])
  extends Saveable with Serializable with PMMLExportable {

That's why you can use toPMML that saves a model into the PMML XML format.

(Again I've got a very little experience in Spark MLlib) My understanding is that KMeans is all about centroids and that's what is loaded when you do KMeansModel.load that in turn uses KMeansModel.SaveLoadV1_0.load that reads the centroids and creates a KMeansModel:

new KMeansModel(localCentroids.sortBy(_.id).map(_.point))

For KMeansModel.toPMML, Spark MLlib uses pmml-model's PMML (as you can see here):

new PMML("4.2", header, null)

I'd recommend exploring pmml-model's PMML how to do saving and loading as that's beyond Spark's realm.


Side notes

Why would you even want to use Spark to have the model after you trained it? It is indeed possible, but you may be wasting your cluster resources for Spark to host the model.

In my limited understanding, the sole purpose of Spark MLlib is to use Spark's features like distribution and parallelism to handle large datasets to build models and use them without the Spark machinery afterwards.

I must be missing something important in my narrow view...

like image 40
Jacek Laskowski Avatar answered Sep 30 '22 00:09

Jacek Laskowski