Pyspark ML - How to save pipeline and RandomForestClassificationModel

Tags:

I unable to save random forest model generated using ml package of python/spark.

>>> rf = RandomForestClassifier(labelCol="label", featuresCol="features")
>>> pipeline = Pipeline(stages=early_stages + [rf])
>>> model = pipeline.fit(trainingData)
>>> model.save("fittedpipeline")

Traceback (most recent call last): File "", line 1, in AttributeError: 'PipelineModel' object has no attribute 'save'

Click to copy

>>> rfModel = model.stages[8]
>>> print(rfModel)

RandomForestClassificationModel (uid=rfc_46c07f6d7ac8) with 20 trees

Click to copy

>> rfModel.save("rfmodel")

Traceback (most recent call last): File "", line 1, in AttributeError: 'RandomForestClassificationModel' object has no attribute 'save'**

Also tried by pass 'sc' as first parameter to save method.

571

asked Jul 08 '17 00:07

Nasir Mahmood

1 Answers

The main issue with your code is that you are using a version of Apache Spark prior to 2.0.0. Thus, save isn't available yet for the Pipeline API.

Here is a full example compounded from the official documentation. Let's create our pipeline first:

Click to copy

from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer

# Load and parse the data file, converting it to a DataFrame.
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
label_indexer = StringIndexer(inputCol="label", outputCol="indexedLabel")
labels = label_indexer.fit(data).labels

# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
feature_indexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4)

early_stages = [label_indexer, feature_indexer]

# Split the data into training and test sets (30% held out for testing)
(train, test) = data.randomSplit([0.7, 0.3])

# Train a RandomForest model.
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", numTrees=10)

# Convert indexed labels back to original labels.
label_converter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=labels)

# Chain indexers and forest in a Pipeline
pipeline = Pipeline(stages=early_stages + [rf, label_converter])

# Train model. This also runs the indexers.
model = pipeline.fit(train)

You can now save your pipeline:

Click to copy

>>> model.save("/tmp/rf")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

You can also save the RF model :

Click to copy

>>> rf_model = model.stages[2]
>>> print(rf_model)
RandomForestClassificationModel (uid=rfc_b368678f4122) with 10 trees
>>> rf_model.save("/tmp/rf_2")

191

answered Apr 02 '23 21:04

eliasah

Related questions
                            
                                Drop if all entries in a spark dataframe's specific column is null
                            
                                How to add a column to the beginning of the schema?
                            
                                spark [dataframe].write.option("mode","overwrite").saveAsTable("foo") fails with 'already exists' if foo exists
                            
                                how to use jni in spark?
                            
                                saveTocassandra could not find implicit value for parameter rwf
                            
                                how to print out snippets of a RDD in the spark-shell / pyspark?
                            
                                Permission denied when starting spark Command line on AWS EMR cluster
                            
                                Spark 1.6.1 S3 MultiObjectDeleteException
                            
                                Spark - Datediff for months?
                            
                                Is querying against a Spark DataFrame based on CSV faster than one based on Parquet?
                            
                                sparksql drop hive table
                            
                                Connect sparklyr to remote spark connection
                            
                                How to save Spark RDD to local filesystem
                            
                                Will Spark SQL completely replace Apache Impala or Apache Hive?
                            
                                Filter dataframe by value NOT present in column of other dataframe [duplicate]
                            
                                Pyspark read multiple csv files into a dataframe (OR RDD?)
                            
                                how to handle millions of smaller s3 files with apache spark
                            
                                pyspark merge two rdd together
                            
                                How to make onehotencoder in Spark to work like onehotencoder in Pandas?
                            
                                How long does RDD remain in memory?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pyspark ML - How to save pipeline and RandomForestClassificationModel

Tags:

apache-spark

pyspark

apache-spark-mllib

Nasir Mahmood

People also ask

1 Answers

eliasah

Recent Activity

Donate For Us