Apache Spark MLlib algorithms (e.g., Decision Trees) save the model in a location (e.g., myModelPath
) where it creates two directories, viz. myModelPath/data
and myModelPath/metadata
. There are multiple files in these paths and those are not text files. There are some files of format *.parquet
.
I have couple of questions:
MLlib is Spark's machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. At a high level, it provides tools such as: ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering.
mllib is the first of the two Spark APIs while org.apache.spark.ml is the new API. spark. mllib carries the original API built on top of RDDs. spark.ml contains higher-level API built on top of DataFrames for constructing ML pipelines.
Pipeline : A Pipeline chains multiple Transformer s and Estimator s together to specify an ML workflow. Parameter : All Transformer s and Estimator s now share a common API for specifying parameters.
Spark >= 2.4
Since Spark 2.4 provides format agnostic writer interfaces and selected models already implement these. For example LinearRegressionModel
:
val lrm: org.apache.spark.ml.regression.LinearRegressionModel = ???
val path: String = ???
lrm.write.format("pmml").save(path)
will create a directory with a single file containing PMML representation.
Spark < 2.4
What are the format of these files?
data/*.parquet
files are in Apache Parquet columnar storage formatmetadata/part-*
looks like JSON Which file/files contain actual model?
model/*.parquet
Can I save the model to somewhere else, for example in a DB?
I am not aware of any direct method but you can load model as a data frame and store it in a database afterwards:
val modelDf = spark.read.parquet("/path/to/data/")
modelDf.write.jdbc(...)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With