Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark ML - How to save pipeline and RandomForestClassificationModel

I unable to save random forest model generated using ml package of python/spark.

>>> rf = RandomForestClassifier(labelCol="label", featuresCol="features")
>>> pipeline = Pipeline(stages=early_stages + [rf])
>>> model = pipeline.fit(trainingData)
>>> model.save("fittedpipeline")

Traceback (most recent call last): File "", line 1, in AttributeError: 'PipelineModel' object has no attribute 'save'

>>> rfModel = model.stages[8]
>>> print(rfModel)

RandomForestClassificationModel (uid=rfc_46c07f6d7ac8) with 20 trees

>> rfModel.save("rfmodel")

Traceback (most recent call last): File "", line 1, in AttributeError: 'RandomForestClassificationModel' object has no attribute 'save'**

Also tried by pass 'sc' as first parameter to save method.

like image 571
Nasir Mahmood Avatar asked Jul 08 '17 00:07

Nasir Mahmood


People also ask

How can we save the PySpark pipeline?

You can now save your pipeline: >>> model. save("/tmp/rf") SLF4J: Failed to load class "org. slf4j.

What is the difference between spark ML and spark MLlib?

Choosing Between Spark MLlib and Spark ML At first glance, the most obvious difference between MLlib and ML is the data types they work on, with MLlib supporting RDDs and ML supporting DataFrame s and Dataset s.

What is PySpark ML pipeline?

class pyspark.ml. Pipeline (*, stages: Optional[List[PipelineStage]] = None)[source] A simple pipeline, which acts as an estimator. A Pipeline consists of a sequence of stages, each of which is either an Estimator or a Transformer .


1 Answers

The main issue with your code is that you are using a version of Apache Spark prior to 2.0.0. Thus, save isn't available yet for the Pipeline API.

Here is a full example compounded from the official documentation. Let's create our pipeline first:

from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer

# Load and parse the data file, converting it to a DataFrame.
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
label_indexer = StringIndexer(inputCol="label", outputCol="indexedLabel")
labels = label_indexer.fit(data).labels

# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
feature_indexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4)

early_stages = [label_indexer, feature_indexer]

# Split the data into training and test sets (30% held out for testing)
(train, test) = data.randomSplit([0.7, 0.3])

# Train a RandomForest model.
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", numTrees=10)

# Convert indexed labels back to original labels.
label_converter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=labels)

# Chain indexers and forest in a Pipeline
pipeline = Pipeline(stages=early_stages + [rf, label_converter])

# Train model. This also runs the indexers.
model = pipeline.fit(train)

You can now save your pipeline:

>>> model.save("/tmp/rf")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

You can also save the RF model :

>>> rf_model = model.stages[2]
>>> print(rf_model)
RandomForestClassificationModel (uid=rfc_b368678f4122) with 10 trees
>>> rf_model.save("/tmp/rf_2")
like image 191
eliasah Avatar answered Apr 02 '23 21:04

eliasah