How to Cross Validate Properly

Tags:

I have been trying to train a ML classifier using Python and the scikit-learn toolkit.

First I applied my own threshold (e.g int(len(X)*0.75)) in splitting the dataset and got this result when printing my metrics:

         precision    recall  f1-score   support

      1       0.63      0.96      0.76        23
      2       0.96      0.64      0.77        36

avg / total   0.83      0.76      0.76        59

Then I used cross validation in order to have a more detailed view of the model's accuracy using: scores = cross_validation.cross_val_score(X, y, cv=10) and got the scores below:

Cross_val_scores= [ 0.66666667 0.79166667 0.45833333 0.70833333 0.52173913
0.52173913 0.47826087 0.47826087 0.52173913 0.47826087]

Accuracy: 0.56 (Standard Deviation: +/- 0.22) , where Accuracy here equals mean(scores).

Can someone please advice me on how to interpret correctly those scores? I understand how the dataset gets split when using cross validation in order to observe the model's accuracy within the whole range of the dataset but I would like to know more.

For instance is there a way to split it and achieve the highest accuracy possible (e.g. 0.79166667) and if so how I could do that?
I imagine that happens because there is a split within my dataset that a model when trained using those data can produce a closer prediction, right?
Is there a way to reduce the relatively high standard deviation?

Thank you for your time.

919

asked Jan 16 '15 18:01

Swan87

1 Answers

is there a way to split it and achieve the highest accuracy possible (e.g. 0.79166667) and if so how I could do that?

Probably, but that only means that the model you get by fitting the training part of the ideal split, has a great accuracy on the validation part of the ideal split. That is called overfitting, .i.e you got a model that is optimized only for specific data, but won't generalize well with new data.

I imagine that happens because there is a split within my dataset that a model when trained using those data can produce a closer prediction

Yes, a closer prediction on the validation part of that particular split.

Is there a way to reduce the relatively high standard deviation?

Yes, by choosing a model with less variance (e.g. a linear model with few parameters). But be aware that in this case you might lose prediction accuracy, this is the so called bias-variance trade-off.

In general you just want to look for a model with a good mean cross validation score (mCVS). But if your models all have the same mCVS then you would go for the one with the least standard deviation. In finance for example where volatility and uncertainty is unwanted models are chosen according to the sharpe ratio, which would be something like mean/std. But in a Kaggle competition where the winning criteria is the mCVS then you obviosly would want to maximize that and ignore the std.

If you are worried that the variation in your dataset is not allowing you to meaningfully compare your models, then you could consider using a different number of splits and shuffling the data before splits.

198

answered Oct 03 '22 11:10

elyase

Related questions
                            
                                Changing the bit-depth of figures produced using Matplotlib
                            
                                How do you insert Google Glass Mirror credentials from python server side code?
                            
                                PyCharm requirements.txt install fails with private GitHub repository and SSH keys
                            
                                Solve Differential equation using Python PyDDE solver
                            
                                import anaconda packages to IDLE?
                            
                                Slow Mac when sending input to an inferior python process
                            
                                How do I embed an Ipython Notebook in an iframe (new)
                            
                                Using celery with Flask app context gives "Popped wrong app context." AssertionError
                            
                                Python adding a blank/empty column. csv
                            
                                Can switching in-and-out PyFrameObjects be a good implementation of continuations?
                            
                                Run script within python package
                            
                                Turkish character encoding
                            
                                redirect to last page not working on python social auth
                            
                                Skype4Py MessageStatus not firing consistently
                            
                                Black voodoo of NumPy Einsum
                            
                                python convert a string to a integer for multiplication [duplicate]
                            
                                How can I convert from scatter size to data coordinates in matplotlib?
                            
                                utf-8 character in user path prevents module from being imported
                            
                                Python : Setting cookie into another website
                            
                                Get Binary Representation of PIL Image Without Saving

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to Cross Validate Properly

Tags:

python

scikit-learn

cross-validation

Swan87

People also ask

1 Answers

elyase

Recent Activity

Donate For Us