First I create two ML algorithms and save them to two separate files. Note that both models are based on the same dataframe. feature_1
and feature_2
are different sets of features extracted from the same dataset.
import sys
from pyspark.ml.classification import RandomForestClassifier
trainer_1 = RandomForestClassifier(featuresCol="features_1")
trainer_2 = RandomForestClassifier(featuresCol="features_2")
model_1 = trainer_1.fit(df_training_data)
model_2 = trainer_2.fit(df_training_data)
model_1.save(sys.argv[1])
model_2.save(sys.argv[2])
Then, when I later want to use the models, I have to load them both from their respective paths, providing the paths f.ex. via sys.argv.
import sys
from pyspark.ml.classification import RandomForestClassificationModel
model_1 = RandomForestClassificationModel.load(sys.argv[1])
model_2 = RandomForestClassificationModel.load(sys.argv[2])
What I want is an elegant way to be able to save these two models together, as one, in the same path. I want this mainly so that the user do not have to keep track of two separate pathnames every time he saves and loads. These two models are closely connected and will generally always be created and used as a together, so they are sort of one model.
Is this the kind of thing pipelines are intended for?
I figured out a way to do it just by placing them together in a folder. Then the user only needs to provide and know the path to this folder.
import sys
import os
from pyspark.ml.classification import RandomForestClassifier
trainer_1 = RandomForestClassifier(featuresCol="features_1")
trainer_2 = RandomForestClassifier(featuresCol="features_2")
model_1 = trainer_1.fit(df_training_data)
model_2 = trainer_2.fit(df_training_data)
path = 'model_rfc'
os.mkdir(path)
model_1.save(os.path.join(sys.argv[1], 'model_1'))
model_2.save(os.path.join(sys.argv[1], 'model_2'))
The names model_1
and model_2
are hardcoded and not needed to be known by the user.
import sys
import os
from pyspark.ml.classification import RandomForestClassificationModel
model_1 = RandomForestClassificationModel.load(os.path.join(sys.argv[1], 'model_1'))
model_2 = RandomForestClassificationModel.load(os.path.join(sys.argv[1], 'model_2'))
This should solve the problem. Is this the best way to do it or could there be an even better way to bundle the models together using functionality from the Spark library?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With