Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Loading a pyspark ML model in a non-Spark environment

I am interested in deploying a machine learning model in python, so predictions can be made through requests to a server.

I will create a Cloudera cluster and take advantage of Spark to develop the models, by using the library pyspark. I would like to know how the model can be saved in order to employ it on the server.

I have seen that the different algorithms have the .save functions (like it is answered in this post How to save and load MLLib model in Apache Spark), but as the server will be in a different machine without Spark and not in the Cloudera cluster, I don't know if it is possible to use their .load and .predict functions.

Can it be made by using the pyspark library functions for prediction without Spark underneath? Or would I have to do any transformations in order to save the model and use it elsewhere?

like image 912
Marcial Gonzalez Avatar asked Nov 21 '16 08:11

Marcial Gonzalez


2 Answers

After spending an hour i got this working code, This may not be optimized,

Mymodel.py:

import os
import sys

# Path for spark source folder
os.environ['SPARK_HOME']="E:\\Work\\spark\\installtion\\spark"

# Append pyspark  to Python Path
sys.path.append("E:\\Work\\spark\\installtion\\spark\\python")

try:
    from pyspark.ml.feature import StringIndexer
    # $example on$
    from numpy import array
    from math import sqrt
    from pyspark import SparkConf
    # $example off$

    from pyspark import SparkContext
    # $example on$
    from pyspark.mllib.clustering import KMeans, KMeansModel

    print ("Successfully imported Spark Modules")

except ImportError as e:
    sys.exit(1)


if __name__ == "__main__":
    sconf = SparkConf().setAppName("KMeansExample").set('spark.sql.warehouse.dir', 'file:///E:/Work/spark/installtion/spark/spark-warehouse/')
    sc = SparkContext(conf=sconf)  # SparkContext
    parsedData =  array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4,2)
    clusters = KMeans.train(sc.parallelize(parsedData), 2, maxIterations=10,
                            runs=10, initializationMode="random")
    clusters.save(sc, "mymodel")  // this will save model to file system
    sc.stop()

This code will create a kmean cluster model and save it in file system.

API.py

from flask import jsonify, request, Flask
from sklearn.externals import joblib
import os
import sys

# Path for spark source folder
os.environ['SPARK_HOME']="E:\\Work\\spark\\installtion\\spark"

# Append pyspark  to Python Path
sys.path.append("E:\\Work\\spark\\installtion\\spark\\python")

try:
    from pyspark.ml.feature import StringIndexer
    # $example on$
    from numpy import array
    from math import sqrt
    from pyspark import SparkConf
    # $example off$

    from pyspark import SparkContext
    # $example on$
    from pyspark.mllib.clustering import KMeans, KMeansModel

    print ("Successfully imported Spark Modules")

except ImportError as e:
    sys.exit(1)


app = Flask(__name__)

@app.route('/', methods=['GET'])
def predict():

    sconf = SparkConf().setAppName("KMeansExample").set('spark.sql.warehouse.dir', 'file:///E:/Work/spark/installtion/spark/spark-warehouse/')
    sc = SparkContext(conf=sconf)  # SparkContext
    sameModel = KMeansModel.load(sc, "clus")  // load from file system 

    response = sameModel.predict(array([0.0, 0.0]))  // pass your data

    return jsonify(response)

if __name__ == '__main__':
    app.run()

Above is my REST api written in flask.

Make the call to http://127.0.0.1:5000/. You can see the response in browser.

like image 81
backtrack Avatar answered Sep 22 '22 01:09

backtrack


Take a look at MLeap (a project I contribute to) - it provides serialization/de-serialization of entire ML Pipelines (not just the estimator) and an execution engine that doesn't rely on the spark context, distributed data frames and execution plans.

As of today, MLeap's runtime for executing models doesn't have python bindings, only scala/java, but shouldn't be complicate to add them. Feel free to reach out on github to myself and other MLeap developers if you need help creating a scoring engine from your Spark-trained pipelines and models.

like image 25
msemeniuk Avatar answered Sep 23 '22 01:09

msemeniuk