Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Xgboost dealing with imbalanced classification data

Tags:

r

xgboost

I have a dataset of some 20000 training examples, on which i want to do a binary classification. The problem is the dataset is heavily imbalanced with only around 1000 being in the positive class. I am trying to use xgboost (in R) for doing my prediction.

I have tried oversampling and undersampling and no matter what i do, somehow the predictions always result in classifiying everything as the majority class.

I tried reading this article on how to tune parameters in xgboost. https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

But it only mentions which parameters help with imbalanced datasets, but not how to tune them.

I would appreciate if anyone has any advice on tuning the learning parameters of xgboost to handle imbalanced datasets and also on how to generate the validation set for such cases.

like image 674
Vikash Balasubramanian Avatar asked Dec 05 '16 06:12

Vikash Balasubramanian


3 Answers

According to XGBoost documentation, the scale_pos_weight parameter is the one dealing with imbalanced classes. See, documentation here

scale_pos_weight, [default=1] Control the balance of positive and negative weights, useful for unbalanced classes. A typical value to consider: sum(negative cases) / sum(positive cases) See Parameters Tuning for more discussion. Also see Higgs Kaggle competition demo for examples: R, py1, py2, py3

like image 97
tagoma Avatar answered Oct 08 '22 20:10

tagoma


Try something like this in R

bstSparse <- xgboost(data =xgbTrain , max_depth = 4, eta = 0.2, nthread = 2, nrounds = 200 ,
                 eval_metric = "auc" , scale_pos_weight = 48, colsample_bytree = 0.7,
                 gamma = 2.5,
                 eval_metric = "logloss",
                 objective = "binary:logistic")

Where scale_pos_weight is the imbalance. My baseline incidence rate is ~ 4%. use hyper parameter optimization. Can try that on scale_pos_weight too

like image 42
Ashish Markanday Avatar answered Oct 08 '22 20:10

Ashish Markanday


A technique useful with neural networks is to introduce some noise into the observations. In R there is the 'jitter' function to do this. For your 1000 rare cases only apply a small amount of jitter to their features to give you another 1000 cases. Run your code again and see if the predictions are now picking up any of the positive class. You can experiment with more added cases and/or varying the amount of jitter. HTH, cousin_pete

like image 40
cousin_pete Avatar answered Oct 08 '22 20:10

cousin_pete