How to get classification probabilities from PySpark MultilayerPerceptronClassifier?

Tags:

I'm using Spark 2.0.1 in python, my dataset is in DataFrame, so I'm using the ML (not MLLib) library for machine learning. I have a multilayer perceptron classifier and I have only two labels.

My question is, is it possible to get not only the labels, but also (or only) the probability for that label? Like not just 0 or 1 for every input, but something like 0.95 for 0 and 0.05 for 1. If this is not possible with MLP, but is possible with other classifier, I can change the classifier. I have only used MLP because I know they should be capable of returning the probability, but I can't find it in PySpark.

I have found a similar topic about this, How to get classification probabilities from MultilayerPerceptronClassifier? but they use Java and the solution they suggested doesn't work in python.

Thx

964

asked Apr 26 '17 10:04

Ondrej

1 Answers

Indeed, as of version 2.0, MLP in Spark ML does not seem to provide classification probabilities; nevertheless, there are a number of other classifiers doing so, i.e. Logistic Regression, Naive Bayes, Decision Tree, and Random Forest. Here is a short example with the first and the last one:

from pyspark.ml.classification import LogisticRegression, RandomForestClassifier
from pyspark.ml.linalg import Vectors
from pyspark.sql import Row
df = sqlContext.createDataFrame([
     (0.0, Vectors.dense(0.0, 1.0)),
     (1.0, Vectors.dense(1.0, 0.0))], 
     ["label", "features"])
df.show()
# +-----+---------+ 
# |label| features| 
# +-----+---------+ 
# | 0.0 |[0.0,1.0]| 
# | 1.0 |[1.0,0.0]| 
# +-----+---------+

lr = LogisticRegression(maxIter=5, regParam=0.01, labelCol="label")
lr_model = lr.fit(df)

rf = RandomForestClassifier(numTrees=3, maxDepth=2, labelCol="label", seed=42)
rf_model = rf.fit(df)

# test data:
test = sc.parallelize([Row(features=Vectors.dense(0.2, 0.5)),
                       Row(features=Vectors.dense(0.5, 0.2))]).toDF()

lr_result = lr_model.transform(test)
lr_result.show()
# +---------+--------------------+--------------------+----------+
# | features|       rawPrediction|         probability|prediction|
# +---------+--------------------+--------------------+----------+
# |[0.2,0.5]|[0.98941878916476...|[0.72897310704261...|       0.0|
# |[0.5,0.2]|[-0.9894187891647...|[0.27102689295738...|       1.0|  
# +---------+--------------------+--------------------+----------+

rf_result = rf_model.transform(test)
rf_result.show()
# +---------+-------------+--------------------+----------+ 
# | features|rawPrediction|         probability|prediction| 
# +---------+-------------+--------------------+----------+ 
# |[0.2,0.5]|    [1.0,2.0]|[0.33333333333333...|       1.0| 
# |[0.5,0.2]|    [1.0,2.0]|[0.33333333333333...|       1.0| 
# +---------+-------------+--------------------+----------+

For MLlib, see my answer here; for several undocumented & counter-intuitive features of PySpark classification, see my relevant blog post.

132

answered Nov 11 '22 09:11

desertnaut

Related questions
                            
                                Using S3 (Frankfurt) with Spark
                            
                                How to enable Fair scheduler?
                            
                                How to use the programmatic spark submit capability
                            
                                Python Spark / Yarn memory usage
                            
                                What is an efficient way to partition by column but maintain a fixed partition count?
                            
                                Is it better for Spark to select from hive or select from file
                            
                                spark streaming fileStream
                            
                                What is the efficient way to update value inside Spark's RDD?
                            
                                Spark: Cut down no. of output files
                            
                                Reading data from SQL Server using Spark SQL
                            
                                How to update Row/column value in a Apache Spark DataFrame?
                            
                                Spark: Save Dataframe in ORC format
                            
                                Spark : Error Not found value SC
                            
                                Grouped linear regression in Spark
                            
                                Spark: what's the advantages of having multiple executors per node for a Job?
                            
                                spark reading data from mysql in parallel
                            
                                Implement a java UDF and call it from pyspark
                            
                                How can I convert a pyspark.sql.dataframe.DataFrame back to a sql table in databricks notebook
                            
                                SQL LIKE in Spark SQL
                            
                                spark filter (delete) rows based on values from another dataframe [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to get classification probabilities from PySpark MultilayerPerceptronClassifier?

Tags:

machine-learning

neural-network

apache-spark

pyspark

apache-spark-ml

Ondrej

People also ask

1 Answers

desertnaut

Recent Activity

Donate For Us