I'm using the R wrapper for XGBoost. In the function xgb.cv, there is a folds parameter with the description
list provides a possibility of using a list of pre-defined CV folds (each element must be a vector of fold's indices). If folds are supplied, the nfold and stratified parameters would be ignored.
So, do I just specify the indices for training the model and assume the rest will be for testing? For example, if my training data is something like
Feature1 Feature2 Target
1: 2 10 10
2: 7 1 9
3: 8 2 3
4: 8 10 7
5: 8 2 9
6: 3 7 3
and I want to cross validate using (train, test) indices as ((1,2,3), (4,5,6)) and ((4,5,6), (1,2,3)) do I set folds=list(c(1,2,3), c(4,5,6))?
Through some trial and error I figured out that xgboost is using the passed indices as indices of the test folds. Confirmed this by noticing the current devel version of xgboost explicitly states it in the documentation.
Here is an example for both generating the folds and using them.
Assume in our dataframe we have a column of ids, such that we want to put all rows with a given id value in a fold.
The code below
iterates over ids, creating lists of row indices that match
fold.ids <- unique(df$id)
custom.folds <- vector("list", length(fold.ids))
i <- 1
for( id in fold.ids){
custom.folds[[i]] <- which( df$id %in% id )
i <- i+1
}
Here is an example using the above fold list in xgb.cv
res <- xgb.cv(param, dtrain, nround, folds=custom.folds, prediction = TRUE)
Reasonable values for other xgb.cv parameters can be found in the documentation
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With