Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

xgboost binary logistic regression

I am having problems running logistic regression with xgboost that can be summarized on the following example.

Lets assume I have a very simple dataframe with two predictors and one target variable:

df= pd.DataFrame({'X1' : pd.Series([1,0,0,1]), 'X2' : pd.Series([0,1,1,0]), 'Y' : pd.Series([0,1,1,0], )})

I can post images because Im new here, but we can clearly see that when X1 =1 and X2=0, Y is 0 and when X1=0 and X2=1, Y is 1.

My idea is to build a model that outputs the probability that an observation belongs to each one of the classes, so if I run xgboost trying to predict two new observations (1,0) and (0,1) like so:

X = df[['X1','X2']].values            
y = df['Y'].values

params  = {'objective': 'binary:logistic',
          'num_class': 2
          } 

clf1 = xgb.train(params=params, dtrain=xgb.DMatrix(X, y), num_boost_round=100)                    
clf1.predict(xgb.DMatrix(test.values)) 

the output is:

array([[ 0.5,  0.5],
       [ 0.5,  0.5]], dtype=float32)

which, I imagine, means that for the first observation, there is 50% chance it belonging to each one of the classes.

I'd like to know why wont the algorithm output a proper (1,0) or something closer to that if the relationship between the variables is clear.

FYI, I did try with more data (Im only using 4 rows for simplicity) and the behavior is almost the same; what I do notice is that, not only the probabilities do not sum to 1, they are often very small like so: (this result is on a different dataset, nothing to do with the example above)

array([[ 0.00356463,  0.00277259],
       [ 0.00315137,  0.00268578],
       [ 0.00453343,  0.00157113],
like image 709
Italo Avatar asked Feb 01 '16 17:02

Italo


People also ask

Is XGBoost good for binary classification?

This modified version of XGBoost is referred to as Class Weighted XGBoost or Cost-Sensitive XGBoost and can offer better performance on binary classification problems with a severe class imbalance.

What is objective binary logistic in XGBoost?

For classification tasks with XGBoost, I know the parameter 'objective' = 'binary:logistic' means specifying a binary classification task with objective function using probability. From my understanding, probability here is just calculating the positive class instances in each leaf of the decision tree.

Is XGBoost better than logistic regression?

The accuracy of the testing data on the logistic regression model is 88% while the XGBoost is 92%. The comparison results show that the XGBoost method has better results based on four evaluation indicators namely accuracy, sensitivity, specificity, and precision.

Can XGBoost be used for regression?

XGBoost can be used directly for regression predictive modeling.


1 Answers

Ok - here's what is happening..

The clue as to why it isn't working is in the fact that in the smaller datasets it cannot train properly. I trained this exact model and observing the dump of all the trees you will see that they cannot split.

(tree dump below)

NO SPLITS, they have been pruned!

[1] "booster[0]" "0:leaf=-0" "booster[1]" "0:leaf=-0" "booster[2]" "0:leaf=-0" [7] "booster[3]" "0:leaf=-0" "booster[4]" "0:leaf=-0" "booster[5]" "0:leaf=-0" [13] "booster[6]" "0:leaf=-0" "booster[7]" "0:leaf=-0" "booster[8]" "0:leaf=-0" [19] "booster[9]" "0:leaf=-0"

There isn't enough weight is each of the leaves to overpower xgboost's internal regularization (which penalizes it for growing)

This parameter may or may not be accessible from the python version, but you can grab it from R if you do a github install

http://xgboost.readthedocs.org/en/latest/parameter.html

lambda [default=1] L2 regularization term on weights

alpha [default=0] L1 regularization term on weights

basically this is why your example trains better as you add more data, but cannot train at all with only 4 examples and default settings.

like image 58
T. Scharf Avatar answered Oct 06 '22 20:10

T. Scharf