When I train a SVC with cross validation, <pre class="prettyprint"><code>y_pred = cross_val_predict(svc, X, y, cv=5, method='predict') </code></pre> <code>cross_val_predict</code> returns one class prediction for each element in X, so that <code>y_pred.shape = (1000,)</code> when <code>m=1000</code>. This makes sense, since <code>cv=5</code> and therefore the SVC was trained and validated 5 times on different parts of X. In each of the five validations, predictions were made for one fifth of the instances (<code>m/5 = 200</code>). Subsequently the 5 vectors, containing 200 predictions each, were merged to <code>y_pred</code>. With all of this in mind it would be reasonable for me to calculate the overall accuracy of the SVC using <code>y_pred</code> and y. <pre class="prettyprint"><code>score = accuracy_score(y, y_pred) </code></pre> But (!) the documentation of <code>cross_val_predict</code> states: <blockquote> The result of cross_val_predict may be different from those obtained using cross_val_score as the elements are grouped in different ways. The function cross_val_score takes an average over cross-validation folds, whereas cross_val_predict simply returns the labels (or probabilities) from several distinct models undistinguished. Thus, cross_val_predict is not an appropriate measure of generalisation error. </blockquote> Could someone please explain in other words, why <code>cross_val_predict</code> is not appropriate for measuring the generalisation error e.g. via <code>accuracy_score(y, y_pred)</code>? <hr> Edit: I first assumed that with <code>cv=5</code> in each of the 5 validations predicitons would be made for all instances of X. But this is wrong, predictions are only made for 1/5 of the instances of X per validation.

<h3>cross_val_score vs cross_val_predict</h3> Differences between <code>cross_val_predict</code> and <code>cross_val_score</code> are described really clearly here and there is another link in there, so you can follow the rabbit. In essence: <ul> <li> <code>cross_val_score</code> returns score for each fold </li> <li> <code>cross_val_predict</code> makes out of fold predictions for each data point.</li> </ul> Now, you have no way of knowing which predictions in <code>cross_val_predict</code> came from which fold, hence you cannot calculate average per fold as <code>cross_val_score</code> does. You could average <code>cross_val_score</code> and <code>accuracy_score</code> of <code>cross_val_predict</code>, but average of averages is not equal to average, hence results would be different. If one fold has a very low accuracy, it would impact the overall average more than in the case of averaged <code>cross_val_predict</code>. Furthermore, you could group those seven data points differently and get different results. That's why there is information about grouping making the difference. <h3>Example of difference between cross_val_score and cross_val_predict</h3> Let's imagine <code>cross_val_predict</code> uses 3 folds for 7 data points and out of fold predictions are <code>[0,1,1,0,1,0,1]</code>, while true targets are <code>[0,1,1,0,1,1,0]</code>. Accuracy score would be calculated as 5/7 (only the last two were badly predicted). Now take those same predictions and split them into following 3 folds: <ul> <li> <code>[0, 1, 1]</code> - prediction and <code>[0, 1, 1]</code> target -> accuracy of 1 for first fold</li> <li> <code>[0, 1]</code> - prediction and <code>[0, 1]</code> target -> perfect accuracy again</li> <li> <code>[0, 1]</code> - prediction and <code>[1, 0]</code> target -> 0 accuracy</li> </ul> This is what <code>cross_val_score</code> does and would return a tuple of accuracies, namely <code>[1, 1, 0]</code>. Now, you can average this tuple and total accuracy is <code>2/3</code>. See? With the same data, you would get two different measures of accuracy (one being <code>5/7</code> and the other <code>2/3</code>). In both cases, grouping changed total accuracy you would obtain. Classifier errors are more severe with <code>cross_val_score</code>, as each errors influences the group's accuracy more than it would influence the average accuracy of all predictions (you can check it on your own). Both could be used for evaluating your model's performance on validation set though and I see no contraindication, just different behavior (fold errors not being as severe). <h3>Why neither is a measure of generalization</h3> If you fit your algorithm according to cross validation schemes, you are performing data leakage (fine-tuning it for the train and validation data). In order to get a sense of generalization error, you would have to leave a part of your data out of cross validation and training. You may want to perform double cross validation or just leave test set out to get how well your model actually generalizes.

Why is cross_val_predict not appropriate for measuring the generalisation error?

Tags:

python

svm

scikit-learn

cross-validation

When I train a SVC with cross validation,

y_pred = cross_val_predict(svc, X, y, cv=5, method='predict')

cross_val_predict returns one class prediction for each element in X, so that y_pred.shape = (1000,) when m=1000. This makes sense, since cv=5 and therefore the SVC was trained and validated 5 times on different parts of X. In each of the five validations, predictions were made for one fifth of the instances (m/5 = 200). Subsequently the 5 vectors, containing 200 predictions each, were merged to y_pred.

With all of this in mind it would be reasonable for me to calculate the overall accuracy of the SVC using y_pred and y.

score = accuracy_score(y, y_pred)

But (!) the documentation of cross_val_predict states:

The result of cross_val_predict may be different from those obtained using cross_val_score as the elements are grouped in different ways. The function cross_val_score takes an average over cross-validation folds, whereas cross_val_predict simply returns the labels (or probabilities) from several distinct models undistinguished. Thus, cross_val_predict is not an appropriate measure of generalisation error.

Could someone please explain in other words, why cross_val_predict is not appropriate for measuring the generalisation error e.g. via accuracy_score(y, y_pred)?

Edit:

I first assumed that with cv=5 in each of the 5 validations predicitons would be made for all instances of X. But this is wrong, predictions are only made for 1/5 of the instances of X per validation.

918

asked Mar 05 '19 18:03

zwithouta

1 Answers

cross_val_score vs cross_val_predict

Differences between cross_val_predict and cross_val_score are described really clearly here and there is another link in there, so you can follow the rabbit.

In essence:

cross_val_score returns score for each fold
cross_val_predict makes out of fold predictions for each data point.

Now, you have no way of knowing which predictions in cross_val_predict came from which fold, hence you cannot calculate average per fold as cross_val_score does. You could average cross_val_score and accuracy_score of cross_val_predict, but average of averages is not equal to average, hence results would be different.

If one fold has a very low accuracy, it would impact the overall average more than in the case of averaged cross_val_predict.

Furthermore, you could group those seven data points differently and get different results. That's why there is information about grouping making the difference.

Example of difference between cross_val_score and cross_val_predict

Let's imagine cross_val_predict uses 3 folds for 7 data points and out of fold predictions are [0,1,1,0,1,0,1], while true targets are [0,1,1,0,1,1,0]. Accuracy score would be calculated as 5/7 (only the last two were badly predicted).

Now take those same predictions and split them into following 3 folds:

[0, 1, 1] - prediction and [0, 1, 1] target -> accuracy of 1 for first fold
[0, 1] - prediction and [0, 1] target -> perfect accuracy again
[0, 1] - prediction and [1, 0] target -> 0 accuracy

This is what cross_val_score does and would return a tuple of accuracies, namely [1, 1, 0]. Now, you can average this tuple and total accuracy is 2/3.

See? With the same data, you would get two different measures of accuracy (one being 5/7 and the other 2/3).

In both cases, grouping changed total accuracy you would obtain. Classifier errors are more severe with cross_val_score, as each errors influences the group's accuracy more than it would influence the average accuracy of all predictions (you can check it on your own).

Both could be used for evaluating your model's performance on validation set though and I see no contraindication, just different behavior (fold errors not being as severe).

Why neither is a measure of generalization

If you fit your algorithm according to cross validation schemes, you are performing data leakage (fine-tuning it for the train and validation data). In order to get a sense of generalization error, you would have to leave a part of your data out of cross validation and training.

You may want to perform double cross validation or just leave test set out to get how well your model actually generalizes.

answered Sep 29 '22 13:09

Szymon Maszke

Related questions
                            
                                Pandas/Numpy NaN None comparison
                            
                                How to define a global error handler in gRPC python
                            
                                Creating a list of empty lists in numba
                            
                                Why does the UnboundLocalError occur on the second variable of the flat comprehension?
                            
                                Pytest select tests based on mark.parameterize value?
                            
                                ValueError: The computed initial MA coefficients are not invertible You should induce invertibility
                            
                                How can I plot a heatmap on a sphere given a list of latitudes and longitudes?
                            
                                Configuring Visual Studio Code for remote Python interpreter via SSH
                            
                                The axis argument to unique is not supported for dtype object
                            
                                How to make a tkinter canvas background transparent?
                            
                                Is there a builtin way to define a function that takes either 1 argument or 3?
                            
                                multivariable linearization in python: 'Pow' object has no attribute 'sqrt'
                            
                                How to make Pycharm run all python unit tests recursively from tests folder
                            
                                I can't import tensorflow-gpu
                            
                                Find the current line number of a running python process
                            
                                Airflow : ExternalTaskSensor doesn't trigger the task
                            
                                Python sum list of dicts by key with nested dicts
                            
                                Efficiently aggregate a resampled collection of datetimes in pandas
                            
                                Loading hdf5 files into python xarrays
                            
                                How can i use tensorflow object detection to only detect persons?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With