Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Poisson Regression in statsmodels and R

Given the some randomly generated data with

  • 2 columns,
  • 50 rows and
  • integer range between 0-100

With R, the poisson glm and diagnostics plot can be achieved as such:

> col=2
> row=50
> range=0:100
> df <- data.frame(replicate(col,sample(range,row,rep=TRUE)))
> model <- glm(X2 ~ X1, data = df, family = poisson)
> glm.diag.plots(model)

In Python, this would give me the line predictor vs residual plot:

import numpy as np
import pandas as pd
import statsmodels.formula.api
from statsmodels.genmod.families import Poisson
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.DataFrame(np.random.randint(100, size=(50,2)))
df.rename(columns={0:'X1', 1:'X2'}, inplace=True)
glm = statsmodels.formula.api.gee
model = glm("X2 ~ X1", groups=None, data=df, family=Poisson())
results = model.fit()

And to plot the diagnostics in Python:

model_fitted_y = results.fittedvalues  # fitted values (need a constant term for intercept)
model_residuals = results.resid # model residuals
model_abs_resid = np.abs(model_residuals)  # absolute residuals


plot_lm_1 = plt.figure(1)
plot_lm_1.set_figheight(8)
plot_lm_1.set_figwidth(12)
plot_lm_1.axes[0] = sns.residplot(model_fitted_y, 'X2', data=df, lowess=True, scatter_kws={'alpha': 0.5}, line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
plot_lm_1.axes[0].set_xlabel('Line Predictor')
plot_lm_1.axes[0].set_ylabel('Residuals')
plt.show()

But when I try to get the cook statistics,

# cook's distance, from statsmodels internals
model_cooks = results.get_influence().cooks_distance[0]

it threw an error saying:

AttributeError                            Traceback (most recent call last)
<ipython-input-66-0f2bedfa1741> in <module>()
      4 model_residuals = results.resid
      5 # normalized residuals
----> 6 model_norm_residuals = results.get_influence().resid_studentized_internal
      7 # absolute squared normalized residuals
      8 model_norm_residuals_abs_sqrt = np.sqrt(np.abs(model_norm_residuals))

/opt/conda/lib/python3.6/site-packages/statsmodels/base/wrapper.py in __getattribute__(self, attr)
     33             pass
     34 
---> 35         obj = getattr(results, attr)
     36         data = results.model.data
     37         how = self._wrap_attrs.get(attr)

AttributeError: 'GEEResults' object has no attribute 'get_influence'

Is there a way to plot out all 4 diagnostic plots in Python like in R?

How do I retrieve the cook statistics of the fitted model results in Python using statsmodels?

like image 899
alvas Avatar asked Dec 07 '17 01:12

alvas


People also ask

How do you visualize a Poisson regression?

The easiest way to visualize it is to just take two different values for xi, e.g., high complexity and low complexity, and then plot the predicted frequency of yi=0,yi=1 etc. for both complexity levels.

When should I use Poisson regression?

Poisson Regression models are best used for modeling events where the outcomes are counts. Or, more specifically, count data: discrete data with non-negative integer values that count something, like the number of times an event occurs during a given timeframe or the number of people in line at the grocery store.

What does Poisson regression show?

Poisson regression is used to model response variables (Y-values) that are counts. It tells you which explanatory variables have a statistically significant effect on the response variable. In other words, it tells you which X-values work on the Y-value.


1 Answers

The generalized estimating equations API should give you a different result than R's GLM model estimation. To get similar estimates in statsmodels, you need to use something like:

import pandas as pd
import statsmodels.api as sm

# Read data generated in R using pandas or something similar
df = pd.read_csv(...) # file name goes here

# Add a column of ones for the intercept to create input X
X = np.column_stack( (np.ones((df.shape[0], 1)), df.X1) )

# Relabel dependent variable as y (standard notation)
y = df.X2

# Fit GLM in statsmodels using Poisson link function
sm.GLM(y, X, family = Poisson()).fit().summary()

EDIT -- Here is the rest of the answer on how to get Cook's distance in Poisson regression. This is a script I wrote based on some data generated in R. I compared my values against those in R calculated using the cooks.distance function and the values matched.

from __future__ import division, print_function

import numpy as np
import pandas as pd
import statsmodels.api as sm

PATH = '/Users/robertmilletich/test_reg.csv'


def _weight_matrix(fitted_model):
    """Calculates weight matrix in Poisson regression

    Parameters
    ----------
    fitted_model : statsmodel object
        Fitted Poisson model

    Returns
    -------
    W : 2d array-like
        Diagonal weight matrix in Poisson regression
    """
    return np.diag(fitted_model.fittedvalues)


def _hessian(X, W):
    """Hessian matrix calculated as -X'*W*X

    Parameters
    ----------
    X : 2d array-like
        Matrix of covariates

    W : 2d array-like
        Weight matrix

    Returns
    -------
    hessian : 2d array-like
        Hessian matrix
    """
    return -np.dot(X.T, np.dot(W, X))


def _hat_matrix(X, W):
    """Calculate hat matrix = W^(1/2) * X * (X'*W*X)^(-1) * X'*W^(1/2)

    Parameters
    ----------
    X : 2d array-like
        Matrix of covariates

    W : 2d array-like
        Diagonal weight matrix

    Returns
    -------
    hat : 2d array-like
        Hat matrix
    """
    # W^(1/2)
    Wsqrt = W**(0.5)

    # (X'*W*X)^(-1)
    XtWX     = -_hessian(X = X, W = W)
    XtWX_inv = np.linalg.inv(XtWX)

    # W^(1/2)*X
    WsqrtX = np.dot(Wsqrt, X)

    # X'*W^(1/2)
    XtWsqrt = np.dot(X.T, Wsqrt)

    return np.dot(WsqrtX, np.dot(XtWX_inv, XtWsqrt))


def main():

    # Load data and separate into X and y
    df = pd.read_csv(PATH)
    X  = np.column_stack( (np.ones((df.shape[0], 1)), df.X1 ) )
    y  = df.X2

    # Fit model
    model = sm.GLM(y, X, family=sm.families.Poisson()).fit()

    # Weight matrix
    W = _weight_matrix(model)

    # Hat matrix
    H   = _hat_matrix(X, W)
    hii = np.diag(H) # Diagonal values of hat matrix

    # Pearson residuals
    r = model.resid_pearson

    # Cook's distance (formula used by R = (res/(1 - hat))^2 * hat/(dispersion * p))
    # Note: dispersion is 1 since we aren't modeling overdispersion
    cooks_d = (r/(1 - hii))**2 * hii/(1*2)

if __name__ == "__main__":
    main()
like image 52
rmilletich Avatar answered Nov 01 '22 03:11

rmilletich