Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parameter oob_score_ in scikit-learn equals accuracy or error?

I implemented Random Forest classifiers (RF) from Python scikit-learn package for a ML problem. In the first stage I used cross validation to spot check other algorithms and RF is now my choice.

Later on I also checked out what the OOB estimation of RF tells me. However, when I compare the return in 'oob_score_' with my results from CV I have a large discrepancy.

The scikit-learn doc tells me:

oob_score : bool

Whether to use out-of-bag samples to estimate the generalization error.

Because of the doc I was assuming that the Parameter 'oob_score_' is the error estimation. But looking for reasons it also came to my mind that it might actually estimate the accuracy instead This would be - at least a bit - closer to my CV results. I checked also the code, and more believe it's the accuracy but wanted to be sure... (in this case I find the doc misleading BTW).

Is oob_score_ in scikit-learn accuracy or error estimation?

like image 514
no_use123 Avatar asked Jul 15 '15 18:07

no_use123


People also ask

What is OOB prediction error?

The out-of-bag (OOB) error is the average error for each calculated using predictions from the trees that do not contain in their respective bootstrap sample. This allows the RandomForestClassifier to be fit and validated whilst being trained [1].

How is an OOB score calculated?

Similarly, each of the OOB sample rows is passed through every DT that did not contain the OOB sample row in its bootstrap training data and a majority prediction is noted for each row. And lastly, the OOB score is computed as the number of correctly predicted rows from the out of bag sample.

What is a good Oob score?

There's no such thing as good oob_score, its the difference between valid_score and oob_score that matters. Think of oob_score as a score for some subset(say, oob_set) of training set. To learn how its created refer this.


1 Answers

It is an analogous of .score method, which returns accuracy of the model. It simply generalizes to to the oob scenario. Documentation is indeed a bit missleading.

As you may find in the code https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/forest.py

for k in range(self.n_outputs_):
            if (predictions[k].sum(axis=1) == 0).any():
                warn("Some inputs do not have OOB scores. "
                     "This probably means too few trees were used "
                     "to compute any reliable oob estimates.")

            decision = (predictions[k] /
                        predictions[k].sum(axis=1)[:, np.newaxis])
            oob_decision_function.append(decision)
            oob_score += np.mean(y[:, k] ==
                                 np.argmax(predictions[k], axis=1), axis=0)

It simply computes average of correct classifications.

like image 105
lejlot Avatar answered Sep 21 '22 13:09

lejlot