Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improving model training speed in caret (R)

I have a dataset consisting of 20 features and roughly 300,000 observations. I'm using caret to train model with doParallel and four cores. Even training on 10% of my data takes well over eight hours for the methods I've tried (rf, nnet, adabag, svmPoly). I'm resampling with with bootstrapping 3 times and my tuneLength is 5. Is there anything I can do to speed up this agonizingly slow process? Someone suggested using the underlying library can speed up my the process as much as 10x, but before I go down that route I'd like to make sure there is no other alternative.

like image 739
Alexander David Avatar asked Oct 02 '15 01:10

Alexander David


People also ask

What does train () do in R?

The train function can generate a candidate set of parameter values and the tuneLength argument controls how many are evaluated. In the case of PLS, the function uses a sequence of integers from 1 to tuneLength . If we want to evaluate all integers between 1 and 15, setting tuneLength = 15 would achieve this.

What is tuneLength?

tuneLength = It allows system to tune algorithm automatically. It indicates the number of different values to try for each tunning parameter. For example, mtry for randomForest. Suppose, tuneLength = 5, it means try 5 different mtry values and find the optimal mtry value based on these 5 values.

What is tuneGrid R?

# The tuneGrid parameter lets us decide which values the main parameter will take # While tuneLength only limit the number of default parameters to use.


1 Answers

What people forget when comparing the underlying model versus using caret is that caret has a lot of extra stuff going on.

Take for example your randomforest. so bootstrap, number 3, and tuneLength 5. So you resample 3 times, and because of the tuneLength you try to find a good value for mtry. In total you run 15 random forests and comparing these to get the best one for the final model, versus only 1 if you use the basic random forest model.

Also you are running parallel on 4 cores and randomforest needs all the observations available, so all your training observations will be 4 times in memory. Probably not much memory left for training the model.

My advice is to start scaling down to see if you can speed things up, like setting the bootstrap number to 1 and tune length back to the default 3. Or even setting the traincontrol method to "none", just to get an idea on how fast the model is on the minimal settings and no resampling.

like image 93
phiver Avatar answered Nov 15 '22 16:11

phiver