I am having problems running logistic regression with xgboost that can be summarized on the following example.
Lets assume I have a very simple dataframe with two predictors and one target variable:
df= pd.DataFrame({'X1' : pd.Series([1,0,0,1]), 'X2' : pd.Series([0,1,1,0]), 'Y' : pd.Series([0,1,1,0], )})
I can post images because Im new here, but we can clearly see that when X1 =1 and X2=0, Y is 0 and when X1=0 and X2=1, Y is 1.
My idea is to build a model that outputs the probability that an observation belongs to each one of the classes, so if I run xgboost trying to predict two new observations (1,0) and (0,1) like so:
X = df[['X1','X2']].values
y = df['Y'].values
params = {'objective': 'binary:logistic',
'num_class': 2
}
clf1 = xgb.train(params=params, dtrain=xgb.DMatrix(X, y), num_boost_round=100)
clf1.predict(xgb.DMatrix(test.values))
the output is:
array([[ 0.5, 0.5],
[ 0.5, 0.5]], dtype=float32)
which, I imagine, means that for the first observation, there is 50% chance it belonging to each one of the classes.
I'd like to know why wont the algorithm output a proper (1,0) or something closer to that if the relationship between the variables is clear.
FYI, I did try with more data (Im only using 4 rows for simplicity) and the behavior is almost the same; what I do notice is that, not only the probabilities do not sum to 1, they are often very small like so: (this result is on a different dataset, nothing to do with the example above)
array([[ 0.00356463, 0.00277259],
[ 0.00315137, 0.00268578],
[ 0.00453343, 0.00157113],
This modified version of XGBoost is referred to as Class Weighted XGBoost or Cost-Sensitive XGBoost and can offer better performance on binary classification problems with a severe class imbalance.
For classification tasks with XGBoost, I know the parameter 'objective' = 'binary:logistic' means specifying a binary classification task with objective function using probability. From my understanding, probability here is just calculating the positive class instances in each leaf of the decision tree.
The accuracy of the testing data on the logistic regression model is 88% while the XGBoost is 92%. The comparison results show that the XGBoost method has better results based on four evaluation indicators namely accuracy, sensitivity, specificity, and precision.
XGBoost can be used directly for regression predictive modeling.
Ok - here's what is happening..
The clue as to why it isn't working is in the fact that in the smaller datasets it cannot train properly. I trained this exact model and observing the dump of all the trees you will see that they cannot split.
(tree dump below)
NO SPLITS, they have been pruned!
[1] "booster[0]" "0:leaf=-0" "booster[1]" "0:leaf=-0" "booster[2]" "0:leaf=-0"
[7] "booster[3]" "0:leaf=-0" "booster[4]" "0:leaf=-0" "booster[5]" "0:leaf=-0"
[13] "booster[6]" "0:leaf=-0" "booster[7]" "0:leaf=-0" "booster[8]" "0:leaf=-0"
[19] "booster[9]" "0:leaf=-0"
There isn't enough weight is each of the leaves to overpower xgboost
's internal regularization (which penalizes it for growing)
This parameter may or may not be accessible from the python version, but you can grab it from R
if you do a github install
http://xgboost.readthedocs.org/en/latest/parameter.html
lambda [default=1] L2 regularization term on weights
alpha [default=0] L1 regularization term on weights
basically this is why your example trains better as you add more data, but cannot train at all with only 4 examples and default settings.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With