Fitting a Poisson distribution to data in statsmodels

Tags:

statsmodels

I am trying to fit a Poisson distribution to my data using statsmodels but I am confused by the results that I am getting and how to use the library.

My real data will be a series of numbers that I think that I should be able to describe as having a poisson distribution plus some outliers so eventually I would like to do a robust fit to the data.

However for testing purposes, I just create a dataset using scipy.stats.poisson

samp = scipy.stats.poisson.rvs(4,size=200)

So to fit this using statsmodels I think that I just need to have a constant 'endog'

res = sm.Poisson(samp,np.ones_like(samp)).fit()

print res.summary()

                          Poisson Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                  200
Model:                        Poisson   Df Residuals:                      199
Method:                           MLE   Df Model:                            0
Date:                Fri, 27 Jun 2014   Pseudo R-squ.:                   0.000
Time:                        14:28:29   Log-Likelihood:                -404.37
converged:                       True   LL-Null:                       -404.37
                                        LLR p-value:                       nan
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          1.3938      0.035     39.569      0.000         1.325     1.463
==============================================================================

Ok, that doesn't look right, But if I do

res.predict()

I get an array of 4.03 (which was the mean for this test sample). So basically, firstly I very confused how to interpret this result from statsmodel and secondly I should probably being doing something completely different if I'm interested in robust parameter estimation of a distribution rather than fitting trends but how should I go about doing that?

Edit I should really have given more detail in order to answer the second part of my question.

I have an event that occurs a random time after a starting time. When I plot a histogram of the delay times for many events, I see that the distribution looks like a scaled Poisson distribution plus several outlier points which are normally caused by issues in my underlying system. So I simply wanted to find the expected time delay for the dataset, excluding the outliers. If not for the outliers, I could simply find the mean time. I suppose that I could exclude them manually but I thought that I could find something more exacting.

Edit On further reflection, I will be considering other distributions instead of sticking with a Poissonion and the details of my issue are probably a distraction from the original question but I've left them here anyway.

583

asked Jun 27 '14 13:06

robochat

1 Answers

The Poisson model, as most other models in generalized linear model families or for other discrete data, assumes that we have a transformation that bounds the prediction in the appropriate range.

Poisson works for nonnegative numbers and the transformation is exp, so the model that is estimated assumes that the expected value of an observation, conditional on the explanatory variables is

 E(y | x) = exp(X dot params)

To get the lambda parameter of the poisson distribution, we need to use exp, i.e.

>>> np.exp(1.3938)
4.0301355071650118

predict does this by default, but you can request just the linear part (X dot params) with a keyword argument.

BTW: statsmodels' controversial terminology endog is y exog is x (has x in it) (http://statsmodels.sourceforge.net/devel/endog_exog.html )

Outlier Robust Estimation

The answer to the last part of the question is that there is currently no outlier robust estimation in Python for Poisson or other count models, as far as I know.

For overdispersed data, where the variance is larger than the mean, we can use NegativeBinomial Regression. For outliers in Poisson we would have to use R/Rpy or do manual trimming of outliers. Outlier identification could be based on one of the standardized residuals.

It will not be available in statsmodels for some time, unless someone is contributing this.

answered Nov 14 '22 20:11

Josef

Related questions
                            
                                How can plot Artists be reused (Line2D)?
                            
                                Django: How to save original filename in FileField?
                            
                                Generating an animated GIF in Python
                            
                                SQS with Celery configuration
                            
                                Is there a way to catch unittest exceptions with PyCharm?
                            
                                Flask and Transfer-Encoding: chunked
                            
                                GDB-Python scripting: any samples iterating through C/C++ struct fields
                            
                                How to make X axis in matplotlib/pylab to NOT sort automatically the values?
                            
                                Unable to import Java classes from Jython module
                            
                                How to get more detailed error information numpy for "RuntimeWarning: invalid value encountered in multiply" error
                            
                                Python PEP 8: Blank lines at the beginning of a module
                            
                                Using Python Requests to 'bridge' a file without loading into memory?
                            
                                python: cooperative supercall of __getattr__
                            
                                Python - showing 'once' warnings again (resetting all warning registries)
                            
                                Trouble running python script as cgi under apache (ubuntu 12)
                            
                                Why GridSearchCV spends more than 50% time on {method 'acquire' of 'thread.lock' objects}?
                            
                                Centering an object in Kivy
                            
                                Interpolation and Extrapolation of Randomly Scattered data to Uniform Grid in 3D
                            
                                Efficient way to process pandas DataFrame timeseries with Numba
                            
                                Segmentation fault and crashing when trying to import opencv

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With