Difference between cross_val_score and cross_val_predict

Tags:

I want to evaluate a regression model build with scikitlearn using cross-validation and getting confused, which of the two functions cross_val_score and cross_val_predict I should use. One option would be :

cvs = DecisionTreeRegressor(max_depth = depth) scores = cross_val_score(cvs, predictors, target, cv=cvfolds, scoring='r2') print("R2-Score: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

An other one, to use the cv-predictions with the standard r2_score:

cvp = DecisionTreeRegressor(max_depth = depth) predictions = cross_val_predict(cvp, predictors, target, cv=cvfolds) print ("CV R^2-Score: {}".format(r2_score(df[target], predictions_cv)))

I would assume that both methods are valid and give similar results. But that is only the case with small k-folds. While the r^2 is roughly the same for 10-fold-cv, it gets increasingly lower for higher k-values in the case of the first version using "cross_vall_score". The second version is mostly unaffected by changing numbers of folds.

Is this behavior to be expected and do I lack some understanding regarding CV in SKLearn?

625

asked Apr 25 '17 14:04

Bobipuegi

1 Answers

cross_val_score returns score of test fold where cross_val_predict returns predicted y values for the test fold.

For the cross_val_score(), you are using the average of the output, which will be affected by the number of folds because then it may have some folds which may have high error (not fit correctly).

Whereas, cross_val_predict() returns, for each element in the input, the prediction that was obtained for that element when it was in the test set. [Note that only cross-validation strategies that assign all elements to a test set exactly once can be used]. So the increasing the number of folds, only increases the training data for the test element, and hence its result may not be affected much.

Edit (after comment)

Please have a look the following answer on how cross_val_predict works:

How is scikit-learn cross_val_predict accuracy score calculated?

I think that cross_val_predict will be overfit because as the folds increase, more data will be for train and less will for test. So the resultant label is more dependent on training data. Also as already told above, the prediction for one sample is done only once, so it may be susceptible to the splitting of data more. Thats why most of the places or tutorials recommend using the cross_val_score for analysis.

170

answered Oct 16 '22 23:10

Vivek Kumar

Related questions
                            
                                Shipping Python modules in pyspark to other nodes
                            
                                Python for-loop without index and item
                            
                                How to map a function using multiple columns in pandas?
                            
                                Python nested context manager on multiple lines [duplicate]
                            
                                Python and Windows Named Pipes
                            
                                Truncating unicode so it fits a maximum size when encoded for wire transfer
                            
                                Multivariate spline interpolation in python/scipy?
                            
                                What is the equivalence in Python 3 of letters in Python 2?
                            
                                How do I see the Python doc on Linux?
                            
                                Setting SQLAlchemy autoincrement start value
                            
                                How to exclude mock package from python coverage report using nosetests
                            
                                Topic distribution: How do we see which document belong to which topic after doing LDA in python
                            
                                How to make nosetests use python3
                            
                                Matplotlib automatic legend outside plot [duplicate]
                            
                                Export Pandas DataFrame into a PDF file using Python
                            
                                Passing a tuple as command line argument
                            
                                Find out if/which BLAS library is used by Numpy
                            
                                Show training and validation accuracy in TensorFlow using same graph
                            
                                Using statsmodel estimations with scikit-learn cross validation, is it possible?
                            
                                Matplotlib: how to adjust space between legend markers and labels?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Difference between cross_val_score and cross_val_predict

Tags:

python

machine-learning

scikit-learn

regression

cross-validation

Bobipuegi

People also ask

1 Answers

Vivek Kumar

Recent Activity

Donate For Us