I am having a problem performing Multiple Regression on a dataset containing around 7500 data points with missing data (NaN) in some columns and rows. There is at least one NaN value in each row. Some rows contain only NaN values.
I am using OLS Statsmodel for the regression analysis. I'm trying not to use Scikit Learn to perform OLS regression because (I might be wrong about this but) I'd have to impute the missing data in my dataset, which would distort the dataset to a certain extent.
My dataset looks like this: KPI
This is what I did (target variable is KP6, predictor variables are the remaining variables):
est2 = ols(formula = KPI.KPI6.name + ' ~ ' + ' + '.join(KPI.drop('KPI6', axis = 1).columns.tolist()), data = KPI).fit()
And it returns a ValueError: zero-size array to reduction operation maximum which has no identity.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-207-b24ba316a452> in <module>()
3 #test = KPI.dropna(how='all')
4 #test = KPI.fillna(0)
----> 5 est2 = ols(formula = KPI.KPI6.name + ' ~ ' + ' + '.join(KPI.drop('KPI6', axis = 1).columns.tolist()), data = KPI).fit()
6 print(est2.summary())
/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/base/model.py in from_formula(cls, formula, data, subset, drop_cols, *args, **kwargs)
172 'formula': formula, # attach formula for unpckling
173 'design_info': design_info})
--> 174 mod = cls(endog, exog, *args, **kwargs)
175 mod.formula = formula
176
/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/regression/linear_model.py in __init__(self, endog, exog, missing, hasconst, **kwargs)
629 **kwargs):
630 super(OLS, self).__init__(endog, exog, missing=missing,
--> 631 hasconst=hasconst, **kwargs)
632 if "weights" in self._init_keys:
633 self._init_keys.remove("weights")
/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/regression/linear_model.py in __init__(self, endog, exog, weights, missing, hasconst, **kwargs)
524 weights = weights.squeeze()
525 super(WLS, self).__init__(endog, exog, missing=missing,
--> 526 weights=weights, hasconst=hasconst, **kwargs)
527 nobs = self.exog.shape[0]
528 weights = self.weights
/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/regression/linear_model.py in __init__(self, endog, exog, **kwargs)
93 """
94 def __init__(self, endog, exog, **kwargs):
---> 95 super(RegressionModel, self).__init__(endog, exog, **kwargs)
96 self._data_attr.extend(['pinv_wexog', 'wendog', 'wexog', 'weights'])
97
/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/base/model.py in __init__(self, endog, exog, **kwargs)
210
211 def __init__(self, endog, exog=None, **kwargs):
--> 212 super(LikelihoodModel, self).__init__(endog, exog, **kwargs)
213 self.initialize()
214
/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/base/model.py in __init__(self, endog, exog, **kwargs)
61 hasconst = kwargs.pop('hasconst', None)
62 self.data = self._handle_data(endog, exog, missing, hasconst,
---> 63 **kwargs)
64 self.k_constant = self.data.k_constant
65 self.exog = self.data.exog
/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/base/model.py in _handle_data(self, endog, exog, missing, hasconst, **kwargs)
86
87 def _handle_data(self, endog, exog, missing, hasconst, **kwargs):
---> 88 data = handle_data(endog, exog, missing, hasconst, **kwargs)
89 # kwargs arrays could have changed, easier to just attach here
90 for key in kwargs:
/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/base/data.py in handle_data(endog, exog, missing, hasconst, **kwargs)
628 klass = handle_data_class_factory(endog, exog)
629 return klass(endog, exog=exog, missing=missing, hasconst=hasconst,
--> 630 **kwargs)
/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/base/data.py in __init__(self, endog, exog, missing, hasconst, **kwargs)
77
78 # this has side-effects, attaches k_constant and const_idx
---> 79 self._handle_constant(hasconst)
80 self._check_integrity()
81 self._cache = resettable_cache()
/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/base/data.py in _handle_constant(self, hasconst)
129 # detect where the constant is
130 check_implicit = False
--> 131 const_idx = np.where(self.exog.ptp(axis=0) == 0)[0].squeeze()
132 self.k_constant = const_idx.size
133
ValueError: zero-size array to reduction operation maximum which has no identity
I suspected that the error arose due to the target variable (i.e KPI6) containing some NaNs, so I tried dropping all rows with KPI6 = NaN like this but the problem still persists:
KPI.dropna(subset = ['KPI6'])
I also tried dropping all rows that contain only NaN values but the problem still persists:
KPI.dropna(how = 'all')
I combined both steps above and the problem still persists. The only way to eliminate this error is to actually impute the missing data with something (e.g 0, mean, median, etc.). However, I'm hoping to avoid this method as much as possible, because I want to perform OLS regression on the original data.
OLS regression also works when I tried selecting only a few variables as predictor variables, but this again is not what I aim to do. I want to include all other variables besides KPI6 as predictor variables.
Is there any solution to this? I've been really stressed out over this for one week. Any help is appreciated. I'm not a pro Python coder so I'd appreciate it if you can break down the problem (& suggest a solution) in layman's terms.
Thanks so much in advance.
The OLS() function of the statsmodels. api module is used to perform OLS regression. It returns an OLS object. Then fit() method is called on this object for fitting the regression line to the data.
statsmodels. formula. api : A convenience interface for specifying models using formula strings and DataFrames. This API directly exposes the from_formula class method of models that support the formula API.
Python: StatsModels I will point out two issues here. The condition number (abbreviated “Cond. No.” in the summary) is a measure of “how close to singular” a matrix is; the higher, the “more singular” (and infinite means singular — i.e. noninvertible), and the more “error” a best fit approximation is.
The default missing handling when using formulas is to drop any row that contains at least one nan. If each row contains a nan, then there are no observations left. I think that's what the end of the traceback ValueError: zero-size array
means.
If you have enough data overall, then you can try imputing and estimating with MICE which will impute iteratively the missing values for each variable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With