I'm evaluating tools for production ML based applications and one of our options is Spark MLlib , but I have some questions about how to serve a model once its trained? For example in Azure ML, once trained, the model is exposed as a web service which can be consumed from any application, and it's a similar case with Amazon ML. How do you serve/deploy ML models in Apache Spark ?

From one hand, a machine learning model built with spark can't be served the way you serve in Azure ML or Amazon ML in a traditional manner. Databricks claims to be able to deploy models using it's notebook but I haven't actually tried that yet. On other hand, you can use a model in three ways : <ul> <li>Training on the fly inside an application then applying prediction. This can be done in a spark application or a notebook. </li> <li>Train a model and save it if it implements an <code>MLWriter</code> then load in an application or a notebook and run it against your data. </li> <li>Train a model with Spark and export it to PMML format using jpmml-spark. PMML allows for different statistical and data mining tools to speak the same language. In this way, a predictive solution can be easily moved among tools and applications without the need for custom coding. e.g from Spark ML to R.</li> </ul> Those are the three possible ways. Of course, you can think of an architecture in which you have RESTful service behind which you can build using spark-jobserver per example to train and deploy but needs some development. It's not a out-of-the-box solution. You might also use projects like Oryx 2 to create your full lambda architecture to train, deploy and serve a model. Unfortunately, describing each of the mentioned above solution is quite broad and doesn't fit in the scope of SO.

How to serve a Spark MLlib model?

1 Answers

From one hand, a machine learning model built with spark can't be served the way you serve in Azure ML or Amazon ML in a traditional manner.

Databricks claims to be able to deploy models using it's notebook but I haven't actually tried that yet.

On other hand, you can use a model in three ways :

Training on the fly inside an application then applying prediction. This can be done in a spark application or a notebook.
Train a model and save it if it implements an MLWriter then load in an application or a notebook and run it against your data.
Train a model with Spark and export it to PMML format using jpmml-spark. PMML allows for different statistical and data mining tools to speak the same language. In this way, a predictive solution can be easily moved among tools and applications without the need for custom coding. e.g from Spark ML to R.

Those are the three possible ways.

Of course, you can think of an architecture in which you have RESTful service behind which you can build using spark-jobserver per example to train and deploy but needs some development. It's not a out-of-the-box solution.

You might also use projects like Oryx 2 to create your full lambda architecture to train, deploy and serve a model.

Unfortunately, describing each of the mentioned above solution is quite broad and doesn't fit in the scope of SO.

120

answered Sep 21 '22 02:09

eliasah

Related questions
                            
                                Cleanest, most efficient syntax to perform DataFrame self-join in Spark
                            
                                SparkSQL vs Hive on Spark - Difference and pros and cons?
                            
                                Compute size of Spark dataframe - SizeEstimator gives unexpected results
                            
                                build.sbt: how to add spark dependencies
                            
                                Why spark-shell fails with NullPointerException?
                            
                                Pyspark convert a standard list to data frame [duplicate]
                            
                                What should be the optimal value for spark.sql.shuffle.partitions or how do we increase partitions when using Spark SQL?
                            
                                Adding a new column in Data Frame derived from other columns (Spark)
                            
                                Spark: Best practice for retrieving big data from RDD to local machine
                            
                                Apache Spark: Differences between client and cluster deploy modes
                            
                                Custom delimiter csv reader spark
                            
                                Create new column with function in Spark Dataframe
                            
                                How to define and use a User-Defined Aggregate Function in Spark SQL?
                            
                                How take a random row from a PySpark DataFrame?
                            
                                Spark 2.0.x dump a csv file from a dataframe containing one array of type string
                            
                                Un-persisting all dataframes in (py)spark
                            
                                Spark SQL replacement for MySQL's GROUP_CONCAT aggregate function
                            
                                Column alias after groupBy in pyspark
                            
                                How to sum the values of one column of a dataframe in spark/scala
                            
                                Split 1 column into 3 columns in spark scala

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to serve a Spark MLlib model?

Tags:

machine-learning

apache-spark

apache-spark-mllib

Luis Leal

People also ask

1 Answers

eliasah

Recent Activity

Donate For Us