Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cross Validation--Use testing set or validation set to predict?

I have a question about cross validation.

In Machine learning, we know there're training, validation, test set. And test set is final run to see how the final model/classifier performed.

But in the process of cross validation: we are splitting data into training set and testing set(most tutorial used this term), so I'm confused. Do we need to split the whole data into 3 parts: training, validation, test? Since in cross validation we just keep talking about relationship with 2 set: training and the other.

Could someone help clarify?

Thanks

like image 434
ADJ Avatar asked Apr 27 '17 16:04

ADJ


People also ask

Do I need a test set if I use cross-validation?

1 Answer. Show activity on this post. Is K-fold cross validation is used to select the final model (or algorithm)? If yes, as you said, then the final model should be tested on an extra set that has no overlap with the data used in K-fold CV (i.e. a test set).

Is cross-validation better than validation set?

Cross-validation. Cross-validation is usually the preferred method because it gives your model the opportunity to train on multiple train-test splits. This gives you a better indication of how well your model will perform on unseen data.

What is cross-validation set used for?

The goal of cross-validation is to estimate the expected level of fit of a model to a data set that is independent of the data that were used to train the model. It can be used to estimate any quantitative measure of fit that is appropriate for the data and model.

How do you predict cross-validation?

Cross-validation in your case would build k estimators (assuming k-fold CV) and then you could check the predictive power and variance of the technique on your data as following: mean of the quality measure. Higher, the better. standard_deviation of the quality measure.


1 Answers

Yep ,it's a little confusing as some material uses CV/test interchangeably and some material does not use ,but i'll try to make it easy to understand by giving the comprehension of why it's needed:

You need the train set to do exactly that, train, but then also you need a way to ensure that your algorithm isn't memorizing the train set(that it's not overfitting) and how well its doing, so that makes the need of the test set so you can give it data it has never seen and you can measure the performance.

But.... ML its all about experimentation, you will train, evaluate, tweak some knob(hyperparameters or architectures), train again, evaluate again over and over, and then you will select the best experiment results, you deploy your system and in production it gets data it's never seen and it doesn't perform that well ,what happened? You used your test data to fit parameters and make decisions , so you overfitted to this test data but you dont know how it does to data never seen.

Cross validation solves this, you have your train data to learn parameters, and test data to evaluate how it does on unseen data, but still need a way to experiment the best hyper parameters and architectures: you take a sample of your training data and call it cross validation set, and hide your test data , you will NEVER use it until the end.

Now use your train data to learn parameters, and experiment with hyper parameters and architectures, but you will evaluate each experiment on the cross validation data instead of test data(you can see it as using CV data as a way to learn the hyperparameters) , after you experimented a lot, and selected your best performing option(on CV), you now use your test data to evaluate how it performs on data it has never seen before deploying it to production.

like image 129
Luis Leal Avatar answered Nov 15 '22 10:11

Luis Leal