xgboost binary logistic regression

Tags:

I am having problems running logistic regression with xgboost that can be summarized on the following example.

Lets assume I have a very simple dataframe with two predictors and one target variable:

df= pd.DataFrame({'X1' : pd.Series([1,0,0,1]), 'X2' : pd.Series([0,1,1,0]), 'Y' : pd.Series([0,1,1,0], )})

I can post images because Im new here, but we can clearly see that when X1 =1 and X2=0, Y is 0 and when X1=0 and X2=1, Y is 1.

My idea is to build a model that outputs the probability that an observation belongs to each one of the classes, so if I run xgboost trying to predict two new observations (1,0) and (0,1) like so:

Click to copy

X = df[['X1','X2']].values            
y = df['Y'].values

params  = {'objective': 'binary:logistic',
          'num_class': 2
          } 

clf1 = xgb.train(params=params, dtrain=xgb.DMatrix(X, y), num_boost_round=100)                    
clf1.predict(xgb.DMatrix(test.values))

the output is:

Click to copy

array([[ 0.5,  0.5],
       [ 0.5,  0.5]], dtype=float32)

which, I imagine, means that for the first observation, there is 50% chance it belonging to each one of the classes.

I'd like to know why wont the algorithm output a proper (1,0) or something closer to that if the relationship between the variables is clear.

FYI, I did try with more data (Im only using 4 rows for simplicity) and the behavior is almost the same; what I do notice is that, not only the probabilities do not sum to 1, they are often very small like so: (this result is on a different dataset, nothing to do with the example above)

Click to copy

array([[ 0.00356463,  0.00277259],
       [ 0.00315137,  0.00268578],
       [ 0.00453343,  0.00157113],

709

asked Feb 01 '16 17:02

Italo

1 Answers

Ok - here's what is happening..

The clue as to why it isn't working is in the fact that in the smaller datasets it cannot train properly. I trained this exact model and observing the dump of all the trees you will see that they cannot split.

(tree dump below)

NO SPLITS, they have been pruned!

[1] "booster[0]" "0:leaf=-0" "booster[1]" "0:leaf=-0" "booster[2]" "0:leaf=-0" [7] "booster[3]" "0:leaf=-0" "booster[4]" "0:leaf=-0" "booster[5]" "0:leaf=-0" [13] "booster[6]" "0:leaf=-0" "booster[7]" "0:leaf=-0" "booster[8]" "0:leaf=-0" [19] "booster[9]" "0:leaf=-0"

There isn't enough weight is each of the leaves to overpower xgboost's internal regularization (which penalizes it for growing)

This parameter may or may not be accessible from the python version, but you can grab it from R if you do a github install

http://xgboost.readthedocs.org/en/latest/parameter.html

lambda [default=1] L2 regularization term on weights

alpha [default=0] L1 regularization term on weights

basically this is why your example trains better as you add more data, but cannot train at all with only 4 examples and default settings.

answered Oct 06 '22 20:10

T. Scharf

Related questions
                            
                                Django unittest read-only test databases
                            
                                Correct way to check if Pandas DataFrame index is a certain type (DatetimeIndex)
                            
                                Sampling groups in Pandas
                            
                                Is it cheaper to reverse an appended list or to prepend a list? - python
                            
                                Single sign on to Django site via remote Active Directory
                            
                                setuptools finds wrong package during install
                            
                                celery beat schedule: run task instantly when start celery beat?
                            
                                Python Logging with a common logger class mixin and class inheritance
                            
                                What is the best way to deal with "_d" suffix for C extensions when using debug build?
                            
                                A more complex version of "How can I tell if a string repeats itself in Python?"
                            
                                reserved keyword is used in protobuf in Python
                            
                                Python psycopg2 cursors
                            
                                Building an animation using Python Gizeh
                            
                                How to Generate Fixtures from Database with SqlAlchemy
                            
                                How can you parallelize a regex search of one long string? [duplicate]
                            
                                Python PIL Image in Label auto resize
                            
                                Python multiprocessing Pool hangs on ubuntu server
                            
                                Probability tree for sentences in nltk employing both lookahead and lookback dependencies
                            
                                Python equivalent of Matlab's clear, close all, clc
                            
                                Calculating Dynamic Time Warping Distance in a Pandas Data Frame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

xgboost binary logistic regression

Tags:

python

machine-learning

logistic-regression

regression

xgboost

Italo

People also ask

1 Answers

T. Scharf

Recent Activity

Donate For Us