Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

feature selection and cross validation

I want to train a regression model and in order to do so I use random forest models. However, I also need to do feature selection cause I have so many features in my dataset and I'm afraid if I used all the feature then I'll be overfitting. In order to assess the performance of my model I also perform a 5 fold cross validation and my question of these following two approaches is right and why?

1- should I split the data into two halves, do feature selection on first half and use these selected features to do 5 fold cross validation (CV) on the remaining half (in this case the 5 CV will be using exactly the same selected features).

2- do the following procedure:

1- split the data into 4/5 for training and 1/5 for testing 2- split this training data (the 4/5 of the full data) in to two halves: a-) on the first half train the model and use the trained model to do feature selection. b-) Use the selected features from the first part in order to train the model on the second half of the training dataset (this will be our final trained model). 3- test the performance of the model on the remaining 1/5 of the data (which is never used in the training phase) 4- repeat the previous step 5 times and in each time we randomly (without replacement) split the data into 4/5 for training and 1/5 for testing

my only concern is that in the second procedure we will have 5 models and the features of the final models will be the union of the top features of these five models, so I'm not sure if the performance of the 5CV can be reflective of the final performance of the final model especially since the final model has different features than each model in the 5fold (cause it's the union of the selected features of each model in the 5 CV)

like image 239
DOSMarter Avatar asked Oct 29 '13 10:10

DOSMarter


1 Answers

Cross validation should always be the outer most loop in any machine learning algorithm.

So, split the data into 5 sets. For every set you choose as your test set (1/5), fit the model after doing a feature selection on the training set (4/5). Repeat this for all the CV folds - here you have 5 folds.

Now once the CV procedure is complete, you have an estimate of your model's accuracy, which is a simple average of your individual CV fold's accuracy.

As far as the final set of features for training the model on the complete set of data is concerned, do the following to select the final set of features.

-- Each time you do a CV on a fold as outlined above, vote for the features that you selected in that particular fold. At the end of 5 fold CV, select a particular number of features that have the top votes.

Use the above selected set of features to do one final procedure of feature selection and then train the model on the complete data (combined of all 5 folds) and move the model to production.

like image 110
London guy Avatar answered Oct 18 '22 06:10

London guy