Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Statsmodels - broadcast shapes different?

I am attempting to perform a logistic regression on a dataset which contains a target variable which is boolean ('default'), and two features ('fico_interp', 'home_ownership_int') using logit module in statsmodels. All three values are from the same data frame, 'traindf':

from sklearn import datasets
import statsmodels.formula.api as smf

lmf = smf.logit('default ~ fico_interp + home_ownership_int',traindf).fit()

Which generates an error message:

ValueError: operands could not be broadcast together with shapes (40406,2) (40406,)

How can this happen?

like image 402
GPB Avatar asked May 18 '15 23:05

GPB


1 Answers

The problem is that traindf['default'] contains values that are not numeric.

The following code reproduces the error:

import pandas as pd, numpy as np, statsmodels.formula.api as smf
df = pd.DataFrame(np.random.randn(1000,2), columns=list('AB'))
df['C'] = ((df['B'] > 0)*1).apply(str)
lmf = smf.logit('C ~ A', df).fit()

And the following code is a possible way to fix this instance:

df.replace(to_replace={'C' : {'1': 1, '0': 0}}, inplace = True)
lmf = smf.logit('C ~ A', df).fit()

This post reports an analogous issue.

like image 57
freeseek Avatar answered Sep 20 '22 00:09

freeseek