I am trying to implement xgboost
on a classification data with imbalanced classes (1% of ones and 99% zeroes).
I am using binary:logistic
as the objective function for classification.
According to my knowledge on xgboost
- As the boosting starts building trees, the objective function is optimized iteratively achieving best performance at the end when all the trees are combined.
In my data due to imbalance in the classes, I am facing the problem of Accuracy Paradox. Where at the end of the model I am able to achieve great accuracy
but poor precision
and recall
I wanted a custom objective function that can optimize the model and returns a final xgboost model with best f-score
. Or can I use any other objective functions that can return in best f-score
?
Where F-Score = (2 * Precision * Recall)/(Precision + Recall)
I'm no expert in the matter, but I think this evaluation metric should do the job:
f1score_eval <- function(preds, dtrain) {
labels <- getinfo(dtrain, "label")
e_TP <- sum( (labels==1) & (preds >= 0.5) )
e_FP <- sum( (labels==0) & (preds >= 0.5) )
e_FN <- sum( (labels==1) & (preds < 0.5) )
e_TN <- sum( (labels==0) & (preds < 0.5) )
e_precision <- e_TP / (e_TP+e_FP)
e_recall <- e_TP / (e_TP+e_FN)
e_f1 <- 2*(e_precision*e_recall)/(e_precision+e_recall)
return(list(metric = "f1-score", value = e_f1))
}
References:
https://github.com/dmlc/xgboost/issues/1152
http://xgboost.readthedocs.io/en/latest/parameter.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With