What do columns ‘rawPrediction’ and ‘probability’ of DataFrame mean in Spark MLlib？

Tags:

After I trained a LogisticRegressionModel, I transformed the test data DF with it and get the prediction DF. And then when I call prediction.show(), the output column names are: [label | features | rawPrediction | probability | prediction]. I know what label and featrues mean, but how should I understand rawPrediction|probability|prediction?

810

asked Jun 19 '16 02:06

Hereme

1 Answers

Note: please also see the answer below by desertnaut https://stackoverflow.com/a/52947815/1056563

RawPrediction is typically the direct probability/confidence calculation. From Spark docs:

Raw prediction for each possible label. The meaning of a "raw" prediction may vary between algorithms, but it intuitively gives a measure of confidence in each possible label (where larger = more confident).

The Prediction is the result of finding the statistical mode of the rawPrediction - viaargmax`:

  protected def raw2prediction(rawPrediction: Vector): Double =
          rawPrediction.argmax

The Probability is the conditional probability for each class. Here is the scaladoc:

Estimate the probability of each class given the raw prediction,
doing the computation in-place. These predictions are also called class conditional probabilities.

The actual calculation depends on which Classifier you are using.

DecisionTree

Normalize a vector of raw predictions to be a multinomial probability vector, in place.

It simply sums by class across the instances and then divides by the total instance count.

 class_k probability = Count_k/Count_Total

LogisticRegression

It uses the logistic formula

 class_k probability: 1/(1 + exp(-rawPrediction_k))

Naive Bayes

 class_k probability = exp(max(rawPrediction) - rawPrediction_k)

Random Forest

 class_k probability = Count_k/Count_Total

answered Sep 25 '22 17:09

WestCoastProjects

Related questions
                            
                                Spark Dataframe validating column names for parquet writes
                            
                                How to use constant value in UDF of Spark SQL(DataFrame)
                            
                                How to join Datasets on multiple columns?
                            
                                Does Spark SQL use Hive Metastore?
                            
                                How do I add a column to a nested struct in a pyspark dataframe?
                            
                                how to use Regexp_replace in spark
                            
                                spark off heap memory config and tungsten
                            
                                Replace missing values with mean - Spark Dataframe
                            
                                Not able to import Spark Implicits in ScalaTest
                            
                                How to read only n rows of large CSV file on HDFS using spark-csv package?
                            
                                How to convert column of arrays of strings to strings?
                            
                                pyspark dataframe add a column if it doesn't exist
                            
                                Stratified sampling with pyspark
                            
                                Why is Spark broadcast exchange data size bigger than raw size on join?
                            
                                Why does spark-shell fail with “error: not found: value spark”?
                            
                                Add a column from another DataFrame
                            
                                How to avoid Spark executor from getting lost and yarn container killing it due to memory limit?
                            
                                How to prepare data into a LibSVM format from DataFrame?
                            
                                How to split a dataframe into dataframes with same column values?
                            
                                Pandas-style transform of grouped data on PySpark DataFrame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What do columns ‘rawPrediction’ and ‘probability’ of DataFrame mean in Spark MLlib？

Tags:

apache-spark-sql

logistic-regression

apache-spark-ml

Hereme

People also ask

1 Answers

WestCoastProjects

Recent Activity

Donate For Us