I am interested in deploying a machine learning model in python, so predictions can be made through requests to a server.
I will create a Cloudera cluster and take advantage of Spark to develop the models, by using the library pyspark. I would like to know how the model can be saved in order to employ it on the server.
I have seen that the different algorithms have the .save functions (like it is answered in this post How to save and load MLLib model in Apache Spark), but as the server will be in a different machine without Spark and not in the Cloudera cluster, I don't know if it is possible to use their .load and .predict functions.
Can it be made by using the pyspark library functions for prediction without Spark underneath? Or would I have to do any transformations in order to save the model and use it elsewhere?
After spending an hour i got this working code, This may not be optimized,
Mymodel.py:
import os
import sys
# Path for spark source folder
os.environ['SPARK_HOME']="E:\\Work\\spark\\installtion\\spark"
# Append pyspark to Python Path
sys.path.append("E:\\Work\\spark\\installtion\\spark\\python")
try:
from pyspark.ml.feature import StringIndexer
# $example on$
from numpy import array
from math import sqrt
from pyspark import SparkConf
# $example off$
from pyspark import SparkContext
# $example on$
from pyspark.mllib.clustering import KMeans, KMeansModel
print ("Successfully imported Spark Modules")
except ImportError as e:
sys.exit(1)
if __name__ == "__main__":
sconf = SparkConf().setAppName("KMeansExample").set('spark.sql.warehouse.dir', 'file:///E:/Work/spark/installtion/spark/spark-warehouse/')
sc = SparkContext(conf=sconf) # SparkContext
parsedData = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4,2)
clusters = KMeans.train(sc.parallelize(parsedData), 2, maxIterations=10,
runs=10, initializationMode="random")
clusters.save(sc, "mymodel") // this will save model to file system
sc.stop()
This code will create a kmean cluster model and save it in file system.
API.py
from flask import jsonify, request, Flask
from sklearn.externals import joblib
import os
import sys
# Path for spark source folder
os.environ['SPARK_HOME']="E:\\Work\\spark\\installtion\\spark"
# Append pyspark to Python Path
sys.path.append("E:\\Work\\spark\\installtion\\spark\\python")
try:
from pyspark.ml.feature import StringIndexer
# $example on$
from numpy import array
from math import sqrt
from pyspark import SparkConf
# $example off$
from pyspark import SparkContext
# $example on$
from pyspark.mllib.clustering import KMeans, KMeansModel
print ("Successfully imported Spark Modules")
except ImportError as e:
sys.exit(1)
app = Flask(__name__)
@app.route('/', methods=['GET'])
def predict():
sconf = SparkConf().setAppName("KMeansExample").set('spark.sql.warehouse.dir', 'file:///E:/Work/spark/installtion/spark/spark-warehouse/')
sc = SparkContext(conf=sconf) # SparkContext
sameModel = KMeansModel.load(sc, "clus") // load from file system
response = sameModel.predict(array([0.0, 0.0])) // pass your data
return jsonify(response)
if __name__ == '__main__':
app.run()
Above is my REST api written in flask.
Make the call to http://127.0.0.1:5000/. You can see the response in browser.
Take a look at MLeap (a project I contribute to) - it provides serialization/de-serialization of entire ML Pipelines (not just the estimator) and an execution engine that doesn't rely on the spark context, distributed data frames and execution plans.
As of today, MLeap's runtime for executing models doesn't have python bindings, only scala/java, but shouldn't be complicate to add them. Feel free to reach out on github to myself and other MLeap developers if you need help creating a scoring engine from your Spark-trained pipelines and models.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With