Predict classes or class probabilities?

Tags:

I am currently using H2O for a classification problem dataset. I am testing it out with H2ORandomForestEstimator in a python 3.6 environment. I noticed the results of the predict method was giving values between 0 to 1(I am assuming this is the probability).

In my data set, the target attribute is numeric i.e. True values are 1 and False values are 0. I made sure I converted the type to category for the target attribute, I was still getting the same result.

Then I modified to the code to convert the target column to factor using asfactor() method on the H2OFrame still, there wasn't any change on the result.

But when I changed the values in the target attribute to True and False for 1 and 0 respectively, I was getting the expected result(i.e) the output was the classification rather than the probability.

What is the right way to get the classified prediction result?
If probabilities are the outcomes for numerical target values, then how do I handle it in case of a multiclass classification?

988

asked Jul 16 '18 18:07

Rahul

1 Answers

In principle & in theory, hard & soft classification (i.e. returning classes & probabilities respectively) are different approaches, each one with its own merits & downsides. Consider for example the following, from the paper Hard or Soft Classification? Large-margin Unified Machines:

Margin-based classifiers have been popular in both machine learning and statistics for classification problems. Among numerous classifiers, some are hard classifiers while some are soft ones. Soft classifiers explicitly estimate the class conditional probabilities and then perform classification based on estimated probabilities. In contrast, hard classifiers directly target on the classification decision boundary without producing the probability estimation. These two types of classifiers are based on different philosophies and each has its own merits.

That said, in practice, most of the classifiers used today, including Random Forest (the only exception I can think of is the SVM family) are in fact soft classifiers: what they actually produce underneath is a probability-like measure, which subsequently, combined with an implicit threshold (usually 0.5 by default in the binary case), gives a hard class membership like 0/1 or True/False.

What is the right way to get the classified prediction result?

For starters, it is always possible to go from probabilities to hard classes, but the opposite is not true.

Generally speaking, and given the fact that your classifier is in fact a soft one, getting just the end hard classifications (True/False) gives a "black box" flavor to the process, which in principle should be undesirable; handling directly the produced probabilities, and (important!) controlling explicitly the decision threshold should be the preferable way here. According to my experience, these are subtleties that are often lost to new practitioners; consider for example the following, from the Cross Validated thread Reduce Classification probability threshold:

the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.

Apart from "soft" arguments (pun unintended) like the above, there are cases where you need to handle directly the underlying probabilities and thresholds, i.e. cases where the default threshold of 0.5 in binary classification will lead you astray, most notably when your classes are imbalanced; see my answer in High AUC but bad predictions with imbalanced data (and the links therein) for a concrete example of such a case.

To be honest, I am rather surprised by the behavior of H2O you report (I haven't use it personally), i.e. that the kind of the output is affected by the representation of the input; this should not be the case, and if it is indeed, we may have an issue of bad design. Compare for example the Random Forest classifier in scikit-learn, which includes two different methods, predict and predict_proba, to get the hard classifications and the underlying probabilities respectively (and checking the docs, it is apparent that the output of predict is based on the probability estimates, which have been computed already before).

If probabilities are the outcomes for numerical target values, then how do I handle it in case of a multiclass classification?

There is nothing new here in principle, apart from the fact that a simple threshold is no longer meaningful; again, from the Random Forest predict docs in scikit-learn:

the predicted class is the one with highest mean probability estimate

That is, for 3 classes (0, 1, 2), you get an estimate of [p0, p1, p2] (with elements summing up to one, as per the rules of probability), and the predicted class is the one with the highest probability, e.g. class #1 for the case of [0.12, 0.60, 0.28]. Here is a reproducible example with the 3-class iris dataset (it's for the GBM algorithm and in R, but the rationale is the same).

186

answered Oct 04 '22 14:10

desertnaut

Related questions
                            
                                Running ChromeDriver with Python selenium on Heroku
                            
                                Flask Marshmallow/SqlAlchemy: Serializing many-to-many relationships
                            
                                Passing a command line argument to airflow BashOperator
                            
                                pandas diff() giving 0 value for first difference, I want the actual value instead
                            
                                ValueError: Unknown label type: 'continuous'
                            
                                Generate decreasing list of integers using python range
                            
                                Avoid overflow with softplus function in python
                            
                                Django registration of tag library not working
                            
                                Python Pandas - select dataframe columns where equals
                            
                                Python NamedTemporaryFile appears empty even after data is written
                            
                                How to filter by multiple criteria in Flask SQLAlchemy?
                            
                                Multiples-keys dictionary where key order doesn't matter
                            
                                Using __add__ operator with multiple arguments in Python
                            
                                Seaborn pairplot ValueError: max must be larger than min in range parameter
                            
                                Sum columns by level in a pandas MultiIndex DataFrame
                            
                                Unittest on AWS Lambda
                            
                                Python flask-cors ImportError: No module named 'flask-cors' Raspberry pi
                            
                                Py4J error when creating a spark dataframe using pyspark
                            
                                Initialise a NumPy array based on its index
                            
                                Pythonic way to create a dictionary from a list where the keys are the elements that are found in another list and values are elements between keys

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Predict classes or class probabilities?

Tags:

python

machine-learning

classification

random-forest

h2o

Rahul

People also ask

1 Answers

desertnaut

Recent Activity

Donate For Us