Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Save and load two ML models in pyspark

First I create two ML algorithms and save them to two separate files. Note that both models are based on the same dataframe. feature_1 and feature_2 are different sets of features extracted from the same dataset.

import sys
from pyspark.ml.classification import RandomForestClassifier

trainer_1 = RandomForestClassifier(featuresCol="features_1")
trainer_2 = RandomForestClassifier(featuresCol="features_2")
model_1 = trainer_1.fit(df_training_data)
model_2 = trainer_2.fit(df_training_data)

model_1.save(sys.argv[1])
model_2.save(sys.argv[2])

Then, when I later want to use the models, I have to load them both from their respective paths, providing the paths f.ex. via sys.argv.

import sys
from pyspark.ml.classification import RandomForestClassificationModel

model_1 = RandomForestClassificationModel.load(sys.argv[1])
model_2 = RandomForestClassificationModel.load(sys.argv[2])

What I want is an elegant way to be able to save these two models together, as one, in the same path. I want this mainly so that the user do not have to keep track of two separate pathnames every time he saves and loads. These two models are closely connected and will generally always be created and used as a together, so they are sort of one model.

Is this the kind of thing pipelines are intended for?

like image 907
PaulMag Avatar asked Aug 01 '17 16:08

PaulMag


1 Answers

I figured out a way to do it just by placing them together in a folder. Then the user only needs to provide and know the path to this folder.

import sys
import os
from pyspark.ml.classification import RandomForestClassifier

trainer_1 = RandomForestClassifier(featuresCol="features_1")
trainer_2 = RandomForestClassifier(featuresCol="features_2")
model_1 = trainer_1.fit(df_training_data)
model_2 = trainer_2.fit(df_training_data)

path = 'model_rfc'
os.mkdir(path)
model_1.save(os.path.join(sys.argv[1], 'model_1'))
model_2.save(os.path.join(sys.argv[1], 'model_2'))

The names model_1 and model_2 are hardcoded and not needed to be known by the user.

import sys
import os
from pyspark.ml.classification import RandomForestClassificationModel

model_1 = RandomForestClassificationModel.load(os.path.join(sys.argv[1], 'model_1'))
model_2 = RandomForestClassificationModel.load(os.path.join(sys.argv[1], 'model_2'))

This should solve the problem. Is this the best way to do it or could there be an even better way to bundle the models together using functionality from the Spark library?

like image 172
PaulMag Avatar answered Oct 05 '22 13:10

PaulMag