I have been trying to train a ML classifier using Python and the scikit-learn toolkit.
First I applied my own threshold (e.g int(len(X)*0.75)) in splitting the dataset and got this result when printing my metrics:
precision recall f1-score support
1 0.63 0.96 0.76 23
2 0.96 0.64 0.77 36
avg / total 0.83 0.76 0.76 59
Then I used cross validation in order to have a more detailed view of the model's accuracy using: scores = cross_validation.cross_val_score(X, y, cv=10) and got the scores below:
Cross_val_scores= [ 0.66666667 0.79166667 0.45833333 0.70833333 0.52173913
0.52173913 0.47826087 0.47826087 0.52173913 0.47826087]
Accuracy: 0.56 (Standard Deviation: +/- 0.22) , where Accuracy here equals mean(scores).
Can someone please advice me on how to interpret correctly those scores? I understand how the dataset gets split when using cross validation in order to observe the model's accuracy within the whole range of the dataset but I would like to know more.
Thank you for your time.
For example, setting k = 2 results in 2-fold cross-validation. In 2-fold cross-validation, we randomly shuffle the dataset into two sets d0 and d1, so that both sets are equal size (this is usually implemented by shuffling the data array and then splitting it in two).
Cross-validation is a statistical method used to estimate the performance (or accuracy) of machine learning models. It is used to protect against overfitting in a predictive model, particularly in a case where the amount of data may be limited.
is there a way to split it and achieve the highest accuracy possible (e.g. 0.79166667) and if so how I could do that?
Probably, but that only means that the model you get by fitting the training part of the ideal split, has a great accuracy on the validation part of the ideal split. That is called overfitting, .i.e you got a model that is optimized only for specific data, but won't generalize well with new data.
I imagine that happens because there is a split within my dataset that a model when trained using those data can produce a closer prediction
Yes, a closer prediction on the validation part of that particular split.
Is there a way to reduce the relatively high standard deviation?
Yes, by choosing a model with less variance (e.g. a linear model with few parameters). But be aware that in this case you might lose prediction accuracy, this is the so called bias-variance trade-off.
In general you just want to look for a model with a good mean cross validation score (mCVS). But if your models all have the same mCVS then you would go for the one with the least standard deviation. In finance for example where volatility and uncertainty is unwanted models are chosen according to the sharpe ratio, which would be something like mean/std. But in a Kaggle competition where the winning criteria is the mCVS then you obviosly would want to maximize that and ignore the std.
If you are worried that the variation in your dataset is not allowing you to meaningfully compare your models, then you could consider using a different number of splits and shuffling the data before splits.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With