I unable to save random forest model generated using ml package of python/spark.
>>> rf = RandomForestClassifier(labelCol="label", featuresCol="features")
>>> pipeline = Pipeline(stages=early_stages + [rf])
>>> model = pipeline.fit(trainingData)
>>> model.save("fittedpipeline")
Traceback (most recent call last): File "", line 1, in AttributeError: 'PipelineModel' object has no attribute 'save'
>>> rfModel = model.stages[8]
>>> print(rfModel)
RandomForestClassificationModel (uid=rfc_46c07f6d7ac8) with 20 trees
>> rfModel.save("rfmodel")
Traceback (most recent call last): File "", line 1, in AttributeError: 'RandomForestClassificationModel' object has no attribute 'save'**
Also tried by pass 'sc' as first parameter to save method.
You can now save your pipeline: >>> model. save("/tmp/rf") SLF4J: Failed to load class "org. slf4j.
Choosing Between Spark MLlib and Spark ML At first glance, the most obvious difference between MLlib and ML is the data types they work on, with MLlib supporting RDDs and ML supporting DataFrame s and Dataset s.
class pyspark.ml. Pipeline (*, stages: Optional[List[PipelineStage]] = None)[source] A simple pipeline, which acts as an estimator. A Pipeline consists of a sequence of stages, each of which is either an Estimator or a Transformer .
The main issue with your code is that you are using a version of Apache Spark prior to 2.0.0. Thus, save
isn't available yet for the Pipeline
API.
Here is a full example compounded from the official documentation. Let's create our pipeline first:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
# Load and parse the data file, converting it to a DataFrame.
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
label_indexer = StringIndexer(inputCol="label", outputCol="indexedLabel")
labels = label_indexer.fit(data).labels
# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
feature_indexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4)
early_stages = [label_indexer, feature_indexer]
# Split the data into training and test sets (30% held out for testing)
(train, test) = data.randomSplit([0.7, 0.3])
# Train a RandomForest model.
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", numTrees=10)
# Convert indexed labels back to original labels.
label_converter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=labels)
# Chain indexers and forest in a Pipeline
pipeline = Pipeline(stages=early_stages + [rf, label_converter])
# Train model. This also runs the indexers.
model = pipeline.fit(train)
You can now save your pipeline:
>>> model.save("/tmp/rf")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
You can also save the RF model :
>>> rf_model = model.stages[2]
>>> print(rf_model)
RandomForestClassificationModel (uid=rfc_b368678f4122) with 10 trees
>>> rf_model.save("/tmp/rf_2")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With