Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python GLM converges but R GLM doesn't?

I am trying to train an (identity link Poisson) GLM model on some data. I built what I believe to be equivalent models in R and Python. In R, the GLM fails to converge. In Python, the GLM does converge. Can anyone explain why this is the case?

Starting from the observation (R code)

b <- c(1607.72888, 1457.32996, 2230.17041, 2371.61865, 2759.7373, 
2200.35571, 3700.06104, 75333.8906, 33347.0156, 24970.9219, 5840.49512, 
2465.53516, 1882.85522, 1027.7406, 796.484009, 863.635925, 926.715149, 
1315.60522, 7734.25879, 255625, 130990.281, 71047.9375, 23927.1992, 
15792.1631, 13041.4033, 6688.22559, 2822.23682, 2115.25049, 1618.78552, 
1517.86987, 482027.531, 213073.016, 98384.4844, 35729.2227, 21157.3594, 
16215.7432, 14006.8438, 4750.69092, 3116.08838, 1554.2738, 587.872559, 
0, 0, 64.625, 16.4270325, 37.5073662, 274.369263, 0)

and the predictor

M <- c(-7.9675869936053e-06, -2.9465510727696e-06, -2.21335646368051e-06, 
3.64491029094931e-05, 1.27785515404535e-05, 3.38522978187467e-05, 
0.00062461548108636, 0.0464058660868474, 0.0196678752619104, 
0.0153096807169331, 0.00349654677177638, 0.0015296157844073, 
0.00101697130553886, 0.00053606322761061, 0.00032545716927196, 
0.00015466321949303, 0.00037175222102242, 0.00032327446276574, 
0.00455841559776405, 0.169167191854014, 0.0871342536136015, 0.0473945835827427, 
0.0161442668339746, 0.010567700966009, 0.00948533330933113, 0.00421478996145066, 
0.00184265032175036, 0.00149379880911122, 0.00103486520709283, 
0.00093688509002056, 0.298346941396298, 0.133754666017589, 0.0633174899164175, 
0.0228254672410591, 0.0134052216553624, 0.0100745170940446, 0.00871104028108377, 
0.00284039230023698, 0.00189497902518704, 0.00080088467813307, 
1.5570018119882e-05, 5.27906279865597e-05, 5.36030668242488e-05, 
2.02817303022873e-05, 2.78819345726129e-05, 2.85829668669995e-05, 
1.83105158084491e-05, 4.3102208421767e-06)

or in python the observation

b = array([[1.60772888e+03],
       [1.45732996e+03],
       [2.23017041e+03],
       [2.37161865e+03],
       [2.75973730e+03],
       [2.20035571e+03],
       [3.70006104e+03],
       [7.53338906e+04],
       [3.33470156e+04],
       [2.49709219e+04],
       [5.84049512e+03],
       [2.46553516e+03],
       [1.88285522e+03],
       [1.02774060e+03],
       [7.96484009e+02],
       [8.63635925e+02],
       [9.26715149e+02],
       [1.31560522e+03],
       [7.73425879e+03],
       [2.55625000e+05],
       [1.30990281e+05],
       [7.10479375e+04],
       [2.39271992e+04],
       [1.57921631e+04],
       [1.30414033e+04],
       [6.68822559e+03],
       [2.82223682e+03],
       [2.11525049e+03],
       [1.61878552e+03],
       [1.51786987e+03],
       [4.82027531e+05],
       [2.13073016e+05],
       [9.83844844e+04],
       [3.57292227e+04],
       [2.11573594e+04],
       [1.62157432e+04],
       [1.40068438e+04],
       [4.75069092e+03],
       [3.11608838e+03],
       [1.55427380e+03],
       [5.87872559e+02],
       [0.00000000e+00],
       [0.00000000e+00],
       [6.46250000e+01],
       [1.64270325e+01],
       [3.75073662e+01],
       [2.74369263e+02],
       [0.00000000e+00]])

and the predictor

M = array([[-7.96758699e-06],
       [-2.94655107e-06],
       [-2.21335646e-06],
       [ 3.64491029e-05],
       [ 1.27785515e-05],
       [ 3.38522978e-05],
       [ 6.24615481e-04],
       [ 4.64058661e-02],
       [ 1.96678753e-02],
       [ 1.53096807e-02],
       [ 3.49654677e-03],
       [ 1.52961578e-03],
       [ 1.01697131e-03],
       [ 5.36063228e-04],
       [ 3.25457169e-04],
       [ 1.54663219e-04],
       [ 3.71752221e-04],
       [ 3.23274463e-04],
       [ 4.55841560e-03],
       [ 1.69167192e-01],
       [ 8.71342536e-02],
       [ 4.73945836e-02],
       [ 1.61442668e-02],
       [ 1.05677010e-02],
       [ 9.48533331e-03],
       [ 4.21478996e-03],
       [ 1.84265032e-03],
       [ 1.49379881e-03],
       [ 1.03486521e-03],
       [ 9.36885090e-04],
       [ 2.98346941e-01],
       [ 1.33754666e-01],
       [ 6.33174899e-02],
       [ 2.28254672e-02],
       [ 1.34052217e-02],
       [ 1.00745171e-02],
       [ 8.71104028e-03],
       [ 2.84039230e-03],
       [ 1.89497903e-03],
       [ 8.00884678e-04],
       [ 1.55700181e-05],
       [ 5.27906280e-05],
       [ 5.36030668e-05],
       [ 2.02817303e-05],
       [ 2.78819346e-05],
       [ 2.85829669e-05],
       [ 1.83105158e-05],
       [ 4.31022084e-06]])

I first tried to construct the glm using R's stats library with:


glm(b ~ 0 + M,
         family = poisson(identity)))

This fails to converge, with the error:

Error: no valid set of coefficients has been found: please supply starting values In addition: Warning message: In log(y/mu) : NaNs produced

I then built the following (I believe equivalent) python model:

poiReg = sm.GLM(b, M, family = 
       sm.families.Poisson(link = 
                           sm.families.links.identity())).fit() 

and this model converged to 1578382.6635341388

like image 561
librus Avatar asked May 26 '26 02:05

librus


1 Answers

I have had almost the same problem as described in the question. A client wanted to use the identity link for a Poisson GLM because they thought it had an easier interpretation.

Other than the potential to predict negative counts, and as mentioned in Tom Wenseleers' comment to the OP:

Identity link Poisson GLMs are badly supported in R, because the algo currently doesn't enforce nonnegativity constraints on the fitted coefficients, which is required for identity link Poisson to converge.

...for me, changing to the log link solved the problem.

like image 115
Robert Long Avatar answered May 28 '26 15:05

Robert Long