Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark MLlib Model File Format

Apache Spark MLlib algorithms (e.g., Decision Trees) save the model in a location (e.g., myModelPath) where it creates two directories, viz. myModelPath/data and myModelPath/metadata. There are multiple files in these paths and those are not text files. There are some files of format *.parquet.

I have couple of questions:

  • What are the format of these files?
  • Which file/files contain actual model?
  • Can I save the model to somewhere else, for example in a DB?
like image 917
Soumya Kanti Avatar asked Aug 12 '15 18:08

Soumya Kanti


People also ask

What is MLlib in Apache spark?

MLlib is Spark's machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. At a high level, it provides tools such as: ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering.

What is the difference between spark ML and spark MLlib?

mllib is the first of the two Spark APIs while org.apache.spark.ml is the new API. spark. mllib carries the original API built on top of RDDs. spark.ml contains higher-level API built on top of DataFrames for constructing ML pipelines.

What is pipeline in spark MLlib?

Pipeline : A Pipeline chains multiple Transformer s and Estimator s together to specify an ML workflow. Parameter : All Transformer s and Estimator s now share a common API for specifying parameters.


1 Answers

Spark >= 2.4

Since Spark 2.4 provides format agnostic writer interfaces and selected models already implement these. For example LinearRegressionModel:

val lrm: org.apache.spark.ml.regression.LinearRegressionModel = ???
val path: String = ???

lrm.write.format("pmml").save(path)

will create a directory with a single file containing PMML representation.

Spark < 2.4

What are the format of these files?

  • data/*.parquet files are in Apache Parquet columnar storage format
  • metadata/part-* looks like JSON

Which file/files contain actual model?

  • model/*.parquet

Can I save the model to somewhere else, for example in a DB?

I am not aware of any direct method but you can load model as a data frame and store it in a database afterwards:

val modelDf = spark.read.parquet("/path/to/data/")
modelDf.write.jdbc(...)
like image 52
zero323 Avatar answered Oct 09 '22 01:10

zero323