I have a question about cross validation.
In Machine learning, we know there're training, validation, test set. And test set is final run to see how the final model/classifier performed.
But in the process of cross validation: we are splitting data into training set and testing set(most tutorial used this term), so I'm confused. Do we need to split the whole data into 3 parts: training, validation, test? Since in cross validation we just keep talking about relationship with 2 set: training and the other.
Could someone help clarify?
Thanks
1 Answer. Show activity on this post. Is K-fold cross validation is used to select the final model (or algorithm)? If yes, as you said, then the final model should be tested on an extra set that has no overlap with the data used in K-fold CV (i.e. a test set).
Cross-validation. Cross-validation is usually the preferred method because it gives your model the opportunity to train on multiple train-test splits. This gives you a better indication of how well your model will perform on unseen data.
The goal of cross-validation is to estimate the expected level of fit of a model to a data set that is independent of the data that were used to train the model. It can be used to estimate any quantitative measure of fit that is appropriate for the data and model.
Cross-validation in your case would build k estimators (assuming k-fold CV) and then you could check the predictive power and variance of the technique on your data as following: mean of the quality measure. Higher, the better. standard_deviation of the quality measure.
Yep ,it's a little confusing as some material uses CV/test interchangeably and some material does not use ,but i'll try to make it easy to understand by giving the comprehension of why it's needed:
You need the train set to do exactly that, train, but then also you need a way to ensure that your algorithm isn't memorizing the train set(that it's not overfitting) and how well its doing, so that makes the need of the test set so you can give it data it has never seen and you can measure the performance.
But.... ML its all about experimentation, you will train, evaluate, tweak some knob(hyperparameters or architectures), train again, evaluate again over and over, and then you will select the best experiment results, you deploy your system and in production it gets data it's never seen and it doesn't perform that well ,what happened? You used your test data to fit parameters and make decisions , so you overfitted to this test data but you dont know how it does to data never seen.
Cross validation solves this, you have your train data to learn parameters, and test data to evaluate how it does on unseen data, but still need a way to experiment the best hyper parameters and architectures: you take a sample of your training data and call it cross validation set, and hide your test data , you will NEVER use it until the end.
Now use your train data to learn parameters, and experiment with hyper parameters and architectures, but you will evaluate each experiment on the cross validation data instead of test data(you can see it as using CV data as a way to learn the hyperparameters) , after you experimented a lot, and selected your best performing option(on CV), you now use your test data to evaluate how it performs on data it has never seen before deploying it to production.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With