Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to Cross Validate Properly

I have been trying to train a ML classifier using Python and the scikit-learn toolkit.

First I applied my own threshold (e.g int(len(X)*0.75)) in splitting the dataset and got this result when printing my metrics:

         precision    recall  f1-score   support

      1       0.63      0.96      0.76        23
      2       0.96      0.64      0.77        36

avg / total   0.83      0.76      0.76        59

Then I used cross validation in order to have a more detailed view of the model's accuracy using: scores = cross_validation.cross_val_score(X, y, cv=10) and got the scores below:

Cross_val_scores= [ 0.66666667 0.79166667 0.45833333 0.70833333 0.52173913
0.52173913 0.47826087 0.47826087 0.52173913 0.47826087]

Accuracy: 0.56 (Standard Deviation: +/- 0.22) , where Accuracy here equals mean(scores).

Can someone please advice me on how to interpret correctly those scores? I understand how the dataset gets split when using cross validation in order to observe the model's accuracy within the whole range of the dataset but I would like to know more.

  • For instance is there a way to split it and achieve the highest accuracy possible (e.g. 0.79166667) and if so how I could do that?
  • I imagine that happens because there is a split within my dataset that a model when trained using those data can produce a closer prediction, right?
  • Is there a way to reduce the relatively high standard deviation?

Thank you for your time.

like image 919
Swan87 Avatar asked Jan 16 '15 18:01

Swan87


People also ask

What is cross-validation example?

For example, setting k = 2 results in 2-fold cross-validation. In 2-fold cross-validation, we randomly shuffle the dataset into two sets d0 and d1, so that both sets are equal size (this is usually implemented by shuffling the data array and then splitting it in two).

Can you explain how cross-validation works?

Cross-validation is a statistical method used to estimate the performance (or accuracy) of machine learning models. It is used to protect against overfitting in a predictive model, particularly in a case where the amount of data may be limited.


1 Answers

is there a way to split it and achieve the highest accuracy possible (e.g. 0.79166667) and if so how I could do that?

Probably, but that only means that the model you get by fitting the training part of the ideal split, has a great accuracy on the validation part of the ideal split. That is called overfitting, .i.e you got a model that is optimized only for specific data, but won't generalize well with new data.

I imagine that happens because there is a split within my dataset that a model when trained using those data can produce a closer prediction

Yes, a closer prediction on the validation part of that particular split.

Is there a way to reduce the relatively high standard deviation?

Yes, by choosing a model with less variance (e.g. a linear model with few parameters). But be aware that in this case you might lose prediction accuracy, this is the so called bias-variance trade-off.

In general you just want to look for a model with a good mean cross validation score (mCVS). But if your models all have the same mCVS then you would go for the one with the least standard deviation. In finance for example where volatility and uncertainty is unwanted models are chosen according to the sharpe ratio, which would be something like mean/std. But in a Kaggle competition where the winning criteria is the mCVS then you obviosly would want to maximize that and ignore the std.

If you are worried that the variation in your dataset is not allowing you to meaningfully compare your models, then you could consider using a different number of splits and shuffling the data before splits.

like image 198
elyase Avatar answered Oct 03 '22 11:10

elyase