Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is cross_val_predict not appropriate for measuring the generalisation error?

When I train a SVC with cross validation,

y_pred = cross_val_predict(svc, X, y, cv=5, method='predict')

cross_val_predict returns one class prediction for each element in X, so that y_pred.shape = (1000,) when m=1000. This makes sense, since cv=5 and therefore the SVC was trained and validated 5 times on different parts of X. In each of the five validations, predictions were made for one fifth of the instances (m/5 = 200). Subsequently the 5 vectors, containing 200 predictions each, were merged to y_pred.

With all of this in mind it would be reasonable for me to calculate the overall accuracy of the SVC using y_pred and y.

score = accuracy_score(y, y_pred)

But (!) the documentation of cross_val_predict states:

The result of cross_val_predict may be different from those obtained using cross_val_score as the elements are grouped in different ways. The function cross_val_score takes an average over cross-validation folds, whereas cross_val_predict simply returns the labels (or probabilities) from several distinct models undistinguished. Thus, cross_val_predict is not an appropriate measure of generalisation error.

Could someone please explain in other words, why cross_val_predict is not appropriate for measuring the generalisation error e.g. via accuracy_score(y, y_pred)?


Edit:

I first assumed that with cv=5 in each of the 5 validations predicitons would be made for all instances of X. But this is wrong, predictions are only made for 1/5 of the instances of X per validation.

like image 918
zwithouta Avatar asked Mar 05 '19 18:03

zwithouta


People also ask

What cross_val_predict used for?

The target variable to try to predict in the case of supervised learning. Group labels for the samples used while splitting the dataset into train/test set. Only used in conjunction with a “Group” cv instance (e.g., GroupKFold ). Determines the cross-validation splitting strategy.

What is the difference between cross_ val_ score and cross_ validate?

The cross_validate function differs from cross_val_score in two ways: It allows specifying multiple metrics for evaluation. It returns a dict containing fit-times, score-times (and optionally training scores as well as fitted estimators) in addition to the test score.

Does Cross_val_score train the model?

Can I train my model using cross_val_score? A common question developers have is whether cross_val_score can also function as a way of training the final model. Unfortunately this is not the case. Cross_val_score is a way of assessing a model and it's parameters, and cannot be used for final training.

What is Cross_val_score?

The cross_val_score are a set of n scores obtained from each of the folds in your n-fold cross validation (by default n=5). If you are working with a classification problem then StratifiedKFold ensures that folds created preserve the percentage of samples for each class.


1 Answers

cross_val_score vs cross_val_predict

Differences between cross_val_predict and cross_val_score are described really clearly here and there is another link in there, so you can follow the rabbit.

In essence:

  • cross_val_score returns score for each fold
  • cross_val_predict makes out of fold predictions for each data point.

Now, you have no way of knowing which predictions in cross_val_predict came from which fold, hence you cannot calculate average per fold as cross_val_score does. You could average cross_val_score and accuracy_score of cross_val_predict, but average of averages is not equal to average, hence results would be different.

If one fold has a very low accuracy, it would impact the overall average more than in the case of averaged cross_val_predict.

Furthermore, you could group those seven data points differently and get different results. That's why there is information about grouping making the difference.

Example of difference between cross_val_score and cross_val_predict

Let's imagine cross_val_predict uses 3 folds for 7 data points and out of fold predictions are [0,1,1,0,1,0,1], while true targets are [0,1,1,0,1,1,0]. Accuracy score would be calculated as 5/7 (only the last two were badly predicted).

Now take those same predictions and split them into following 3 folds:

  • [0, 1, 1] - prediction and [0, 1, 1] target -> accuracy of 1 for first fold
  • [0, 1] - prediction and [0, 1] target -> perfect accuracy again
  • [0, 1] - prediction and [1, 0] target -> 0 accuracy

This is what cross_val_score does and would return a tuple of accuracies, namely [1, 1, 0]. Now, you can average this tuple and total accuracy is 2/3.

See? With the same data, you would get two different measures of accuracy (one being 5/7 and the other 2/3).

In both cases, grouping changed total accuracy you would obtain. Classifier errors are more severe with cross_val_score, as each errors influences the group's accuracy more than it would influence the average accuracy of all predictions (you can check it on your own).

Both could be used for evaluating your model's performance on validation set though and I see no contraindication, just different behavior (fold errors not being as severe).

Why neither is a measure of generalization

If you fit your algorithm according to cross validation schemes, you are performing data leakage (fine-tuning it for the train and validation data). In order to get a sense of generalization error, you would have to leave a part of your data out of cross validation and training.

You may want to perform double cross validation or just leave test set out to get how well your model actually generalizes.

like image 62
Szymon Maszke Avatar answered Sep 29 '22 13:09

Szymon Maszke