I am currently using H2O for a classification problem dataset. I am testing it out with H2ORandomForestEstimator
in a python 3.6 environment. I noticed the results of the predict method was giving values between 0 to 1(I am assuming this is the probability).
In my data set, the target attribute is numeric i.e. True
values are 1 and False
values are 0. I made sure I converted the type to category for the target attribute, I was still getting the same result.
Then I modified to the code to convert the target column to factor using asfactor()
method on the H2OFrame still, there wasn't any change on the result.
But when I changed the values in the target attribute to True and False for 1 and 0 respectively, I was getting the expected result(i.e) the output was the classification rather than the probability.
Probability would be the likely occurrence of one or more event(s) in the presence of all possible events. The two 'scenario' questions you have must be specific to your field. Prediction is usually a definitive and specific statement about the value of some variable at the specific time in the future.
Exercise. Probabilities and classes—What's the relationship between the predicted probabilities and the predicted classes? You determine the predicted probabilities by looking at the average accuracy of the predicted classes. There is no relationship; they're completely different things.
For example, you could use a naïve Bayes algorithm, to differentiate three classes of dog breeds — terrier, hound, and sport dogs. Each class has three predictors — hair length, height, and weight. The algorithm does something called class predictor probability.
The predict method is used to predict the actual class while predict_proba method can be used to infer the class probabilities (i.e. the probability that a particular data point falls into the underlying classes).
In principle & in theory, hard & soft classification (i.e. returning classes & probabilities respectively) are different approaches, each one with its own merits & downsides. Consider for example the following, from the paper Hard or Soft Classification? Large-margin Unified Machines:
Margin-based classifiers have been popular in both machine learning and statistics for classification problems. Among numerous classifiers, some are hard classifiers while some are soft ones. Soft classifiers explicitly estimate the class conditional probabilities and then perform classification based on estimated probabilities. In contrast, hard classifiers directly target on the classification decision boundary without producing the probability estimation. These two types of classifiers are based on different philosophies and each has its own merits.
That said, in practice, most of the classifiers used today, including Random Forest (the only exception I can think of is the SVM family) are in fact soft classifiers: what they actually produce underneath is a probability-like measure, which subsequently, combined with an implicit threshold (usually 0.5 by default in the binary case), gives a hard class membership like 0/1
or True/False
.
What is the right way to get the classified prediction result?
For starters, it is always possible to go from probabilities to hard classes, but the opposite is not true.
Generally speaking, and given the fact that your classifier is in fact a soft one, getting just the end hard classifications (True/False
) gives a "black box" flavor to the process, which in principle should be undesirable; handling directly the produced probabilities, and (important!) controlling explicitly the decision threshold should be the preferable way here. According to my experience, these are subtleties that are often lost to new practitioners; consider for example the following, from the Cross Validated thread Reduce Classification probability threshold:
the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.
Apart from "soft" arguments (pun unintended) like the above, there are cases where you need to handle directly the underlying probabilities and thresholds, i.e. cases where the default threshold of 0.5 in binary classification will lead you astray, most notably when your classes are imbalanced; see my answer in High AUC but bad predictions with imbalanced data (and the links therein) for a concrete example of such a case.
To be honest, I am rather surprised by the behavior of H2O you report (I haven't use it personally), i.e. that the kind of the output is affected by the representation of the input; this should not be the case, and if it is indeed, we may have an issue of bad design. Compare for example the Random Forest classifier in scikit-learn, which includes two different methods, predict
and predict_proba
, to get the hard classifications and the underlying probabilities respectively (and checking the docs, it is apparent that the output of predict
is based on the probability estimates, which have been computed already before).
If probabilities are the outcomes for numerical target values, then how do I handle it in case of a multiclass classification?
There is nothing new here in principle, apart from the fact that a simple threshold is no longer meaningful; again, from the Random Forest predict
docs in scikit-learn:
the predicted class is the one with highest mean probability estimate
That is, for 3 classes (0, 1, 2)
, you get an estimate of [p0, p1, p2]
(with elements summing up to one, as per the rules of probability), and the predicted class is the one with the highest probability, e.g. class #1 for the case of [0.12, 0.60, 0.28]
. Here is a reproducible example with the 3-class iris dataset (it's for the GBM algorithm and in R, but the rationale is the same).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With