Multiple OLS Regression with Statsmodel ValueError: zero-size array to reduction operation maximum which has no identity

Tags:

I am having a problem performing Multiple Regression on a dataset containing around 7500 data points with missing data (NaN) in some columns and rows. There is at least one NaN value in each row. Some rows contain only NaN values.

I am using OLS Statsmodel for the regression analysis. I'm trying not to use Scikit Learn to perform OLS regression because (I might be wrong about this but) I'd have to impute the missing data in my dataset, which would distort the dataset to a certain extent.

My dataset looks like this: KPI

This is what I did (target variable is KP6, predictor variables are the remaining variables):

est2 = ols(formula = KPI.KPI6.name + ' ~ ' + ' + '.join(KPI.drop('KPI6', axis = 1).columns.tolist()), data = KPI).fit()

And it returns a ValueError: zero-size array to reduction operation maximum which has no identity.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-207-b24ba316a452> in <module>()
      3 #test = KPI.dropna(how='all')
      4 #test = KPI.fillna(0)
----> 5 est2 = ols(formula = KPI.KPI6.name + ' ~ ' + ' + '.join(KPI.drop('KPI6', axis = 1).columns.tolist()), data = KPI).fit()
      6 print(est2.summary())

/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/base/model.py in from_formula(cls, formula, data, subset, drop_cols, *args, **kwargs)
    172                        'formula': formula,  # attach formula for unpckling
    173                        'design_info': design_info})
--> 174         mod = cls(endog, exog, *args, **kwargs)
    175         mod.formula = formula
    176 

/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/regression/linear_model.py in __init__(self, endog, exog, missing, hasconst, **kwargs)
    629                  **kwargs):
    630         super(OLS, self).__init__(endog, exog, missing=missing,
--> 631                                   hasconst=hasconst, **kwargs)
    632         if "weights" in self._init_keys:
    633             self._init_keys.remove("weights")

/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/regression/linear_model.py in __init__(self, endog, exog, weights, missing, hasconst, **kwargs)
    524             weights = weights.squeeze()
    525         super(WLS, self).__init__(endog, exog, missing=missing,
--> 526                                   weights=weights, hasconst=hasconst, **kwargs)
    527         nobs = self.exog.shape[0]
    528         weights = self.weights

/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/regression/linear_model.py in __init__(self, endog, exog, **kwargs)
     93     """
     94     def __init__(self, endog, exog, **kwargs):
---> 95         super(RegressionModel, self).__init__(endog, exog, **kwargs)
     96         self._data_attr.extend(['pinv_wexog', 'wendog', 'wexog', 'weights'])
     97 

/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/base/model.py in __init__(self, endog, exog, **kwargs)
    210 
    211     def __init__(self, endog, exog=None, **kwargs):
--> 212         super(LikelihoodModel, self).__init__(endog, exog, **kwargs)
    213         self.initialize()
    214 

/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/base/model.py in __init__(self, endog, exog, **kwargs)
     61         hasconst = kwargs.pop('hasconst', None)
     62         self.data = self._handle_data(endog, exog, missing, hasconst,
---> 63                                       **kwargs)
     64         self.k_constant = self.data.k_constant
     65         self.exog = self.data.exog

/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/base/model.py in _handle_data(self, endog, exog, missing, hasconst, **kwargs)
     86 
     87     def _handle_data(self, endog, exog, missing, hasconst, **kwargs):
---> 88         data = handle_data(endog, exog, missing, hasconst, **kwargs)
     89         # kwargs arrays could have changed, easier to just attach here
     90         for key in kwargs:

/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/base/data.py in handle_data(endog, exog, missing, hasconst, **kwargs)
    628     klass = handle_data_class_factory(endog, exog)
    629     return klass(endog, exog=exog, missing=missing, hasconst=hasconst,
--> 630                  **kwargs)

/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/base/data.py in __init__(self, endog, exog, missing, hasconst, **kwargs)
     77 
     78         # this has side-effects, attaches k_constant and const_idx
---> 79         self._handle_constant(hasconst)
     80         self._check_integrity()
     81         self._cache = resettable_cache()

/Users/anhtran/anaconda/lib/python3.6/site-packages/statsmodels/base/data.py in _handle_constant(self, hasconst)
    129             # detect where the constant is
    130             check_implicit = False
--> 131             const_idx = np.where(self.exog.ptp(axis=0) == 0)[0].squeeze()
    132             self.k_constant = const_idx.size
    133 

ValueError: zero-size array to reduction operation maximum which has no identity

I suspected that the error arose due to the target variable (i.e KPI6) containing some NaNs, so I tried dropping all rows with KPI6 = NaN like this but the problem still persists:

KPI.dropna(subset = ['KPI6'])

I also tried dropping all rows that contain only NaN values but the problem still persists:

KPI.dropna(how = 'all')

I combined both steps above and the problem still persists. The only way to eliminate this error is to actually impute the missing data with something (e.g 0, mean, median, etc.). However, I'm hoping to avoid this method as much as possible, because I want to perform OLS regression on the original data.

OLS regression also works when I tried selecting only a few variables as predictor variables, but this again is not what I aim to do. I want to include all other variables besides KPI6 as predictor variables.

Is there any solution to this? I've been really stressed out over this for one week. Any help is appreciated. I'm not a pro Python coder so I'd appreciate it if you can break down the problem (& suggest a solution) in layman's terms.

Thanks so much in advance.

820

asked Aug 07 '17 11:08

Hanazono Sakura

1 Answers

The default missing handling when using formulas is to drop any row that contains at least one nan. If each row contains a nan, then there are no observations left. I think that's what the end of the traceback ValueError: zero-size array means.

If you have enough data overall, then you can try imputing and estimating with MICE which will impute iteratively the missing values for each variable.

177

answered Nov 15 '22 00:11

Josef

Related questions
                            
                                Get unix file type with Python os module
                            
                                How does Python share memory among multiple processes?
                            
                                Python subclass counter
                            
                                Pass a header from nginx to uWSGI backend running a Flask application
                            
                                Unable to retrieve Chinese texts while scraping
                            
                                How do I set an environment variable for airflow to use?
                            
                                pyconfig.h - Cannot open include file: 'io.h': No such file or directory
                            
                                Enforcing in-memory transposition of a numpy array [duplicate]
                            
                                Keras: Cannot Import Name np_utils [duplicate]
                            
                                python: use assert to raise different Error types
                            
                                What is the meaning of X[:,:,:,i] in numpy?
                            
                                How to balance a chemical equation in Python 2.7 Using matrices
                            
                                What is the range of Scikit-Learn's IsolationForest decision_function scores?
                            
                                How to do a scatter plot with different edgecolor in matplotlib?
                            
                                Correct way to validate GET parameters in django
                            
                                Python Nose2 Tests Not Finishing When Class Method Called
                            
                                Why adding multiple 'nan' in python dictionary giving multiple entries?
                            
                                I can't compare dataframe to a string! But I can compare its transpose
                            
                                AttributeError: Can't get attribute on <module '__main__' from 'manage.py'>
                            
                                What is the efficient way to check two memoryviews in loop?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Multiple OLS Regression with Statsmodel ValueError: zero-size array to reduction operation maximum which has no identity

Tags:

python

statsmodels

regression

valueerror

Hanazono Sakura

People also ask

1 Answers

Josef

Recent Activity

Donate For Us