Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fitting a Poisson distribution to data in statsmodels

I am trying to fit a Poisson distribution to my data using statsmodels but I am confused by the results that I am getting and how to use the library.

My real data will be a series of numbers that I think that I should be able to describe as having a poisson distribution plus some outliers so eventually I would like to do a robust fit to the data.

However for testing purposes, I just create a dataset using scipy.stats.poisson

samp = scipy.stats.poisson.rvs(4,size=200)

So to fit this using statsmodels I think that I just need to have a constant 'endog'

res = sm.Poisson(samp,np.ones_like(samp)).fit()

print res.summary()

                          Poisson Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                  200
Model:                        Poisson   Df Residuals:                      199
Method:                           MLE   Df Model:                            0
Date:                Fri, 27 Jun 2014   Pseudo R-squ.:                   0.000
Time:                        14:28:29   Log-Likelihood:                -404.37
converged:                       True   LL-Null:                       -404.37
                                        LLR p-value:                       nan
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          1.3938      0.035     39.569      0.000         1.325     1.463
==============================================================================

Ok, that doesn't look right, But if I do

res.predict()

I get an array of 4.03 (which was the mean for this test sample). So basically, firstly I very confused how to interpret this result from statsmodel and secondly I should probably being doing something completely different if I'm interested in robust parameter estimation of a distribution rather than fitting trends but how should I go about doing that?

Edit I should really have given more detail in order to answer the second part of my question.

I have an event that occurs a random time after a starting time. When I plot a histogram of the delay times for many events, I see that the distribution looks like a scaled Poisson distribution plus several outlier points which are normally caused by issues in my underlying system. So I simply wanted to find the expected time delay for the dataset, excluding the outliers. If not for the outliers, I could simply find the mean time. I suppose that I could exclude them manually but I thought that I could find something more exacting.

Edit On further reflection, I will be considering other distributions instead of sticking with a Poissonion and the details of my issue are probably a distraction from the original question but I've left them here anyway.

like image 583
robochat Avatar asked Jun 27 '14 13:06

robochat


People also ask

How do I know if my data fits Poisson?

Testing the Goodness-of-Fit for a Poisson Distribution Values must be integers that are greater than or equal to zero. For example, the number of sales per day in a store can follow the Poisson distribution. If these data follow the Poisson distribution, you can use this distribution to make predictions.

How do you code a Poisson distribution in Python?

The Poisson distribution describes the probability of obtaining k successes during a given time interval. If a random variable X follows a Poisson distribution, then the probability that X = k successes can be found by the following formula: P(X=k) = λk * e– λ / k!

What is Poisson regression model?

Poisson regression is used to predict a dependent variable that consists of "count data" given one or more independent variables. The variable we want to predict is called the dependent variable (or sometimes the response, outcome, target or criterion variable).


1 Answers

The Poisson model, as most other models in generalized linear model families or for other discrete data, assumes that we have a transformation that bounds the prediction in the appropriate range.

Poisson works for nonnegative numbers and the transformation is exp, so the model that is estimated assumes that the expected value of an observation, conditional on the explanatory variables is

 E(y | x) = exp(X dot params)

To get the lambda parameter of the poisson distribution, we need to use exp, i.e.

>>> np.exp(1.3938)
4.0301355071650118

predict does this by default, but you can request just the linear part (X dot params) with a keyword argument.

BTW: statsmodels' controversial terminology endog is y exog is x (has x in it) (http://statsmodels.sourceforge.net/devel/endog_exog.html )

Outlier Robust Estimation

The answer to the last part of the question is that there is currently no outlier robust estimation in Python for Poisson or other count models, as far as I know.

For overdispersed data, where the variance is larger than the mean, we can use NegativeBinomial Regression. For outliers in Poisson we would have to use R/Rpy or do manual trimming of outliers. Outlier identification could be based on one of the standardized residuals.

It will not be available in statsmodels for some time, unless someone is contributing this.

like image 63
Josef Avatar answered Nov 14 '22 20:11

Josef