Apache Spark MLlib Model File Format

Tags:

Apache Spark MLlib algorithms (e.g., Decision Trees) save the model in a location (e.g., myModelPath) where it creates two directories, viz. myModelPath/data and myModelPath/metadata. There are multiple files in these paths and those are not text files. There are some files of format *.parquet.

I have couple of questions:

What are the format of these files?
Which file/files contain actual model?
Can I save the model to somewhere else, for example in a DB?

917

asked Aug 12 '15 18:08

Soumya Kanti

1 Answers

Spark >= 2.4

Since Spark 2.4 provides format agnostic writer interfaces and selected models already implement these. For example LinearRegressionModel:

val lrm: org.apache.spark.ml.regression.LinearRegressionModel = ???
val path: String = ???

lrm.write.format("pmml").save(path)

will create a directory with a single file containing PMML representation.

Spark < 2.4

What are the format of these files?

data/*.parquet files are in Apache Parquet columnar storage format
metadata/part-* looks like JSON

Which file/files contain actual model?

model/*.parquet

Can I save the model to somewhere else, for example in a DB?

I am not aware of any direct method but you can load model as a data frame and store it in a database afterwards:

val modelDf = spark.read.parquet("/path/to/data/")
modelDf.write.jdbc(...)

answered Oct 09 '22 01:10

zero323

Related questions
                            
                                Apache Spark ALS collaborative filtering results. They don't make sense
                            
                                Apache Spark: SparkPi Example
                            
                                How to sort data in spark streaming
                            
                                Spark: Efficient mass lookup in pair RDD's
                            
                                How to 'Pipe' Binary Data in Apache Spark
                            
                                Configure Scala Script in IntelliJ IDE to run a spark standalone script through spark-submit
                            
                                Hadoop's HDFS with Spark
                            
                                No module named numpy when spark-submitting
                            
                                spark cache only keeps a fraction of RDD
                            
                                joins and cogroup in Spark
                            
                                Spark - failed on connection exception: java.net.ConnectException - localhost
                            
                                Error while installing Apache SparkR package
                            
                                Joining two DataFrames from the same source
                            
                                Connecting from Spark/pyspark to PostgreSQL
                            
                                how do I preserve the key or index of input to Spark HashingTF() function?
                            
                                Can I change Spark's executor memory at runtime?
                            
                                How to specify a missing value in a dataframe
                            
                                Spark joinWithCassandraTable() on map multiple partition key ERROR
                            
                                Spark + Python - Java gateway process exited before sending the driver its port number?
                            
                                How do you add a numpy.array as a new column to a pyspark.SQL DataFrame?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Apache Spark MLlib Model File Format

Tags:

apache-spark

apache-spark-mllib

Soumya Kanti

People also ask

1 Answers

zero323

Recent Activity

Donate For Us