I have a data set like:
Growth NHSPSTY% Index USURTOT Index GLPFTOCI Index CPTICHNG Index NAPMPMI Index RSTAXYOY Index SAARTOTL Index USASHVTK Index CONCCONF Index LEI TOTL Index SPX Index TOT_DEBT_TO_TOT_EQY BDIY Index cry index CO1 Comdty
Date
1998-03-31 4.1 7.5 4.7 0.121000 83.5325 52.9 2.9 -0.032258 0.404 133.80 88.9 0.455185 197.26 966 169.04 14.26
1998-06-30 3.8 9.8 4.5 0.125556 82.2970 48.9 4.5 0.154930 0.393 138.23 88.6 0.280973 204.65 856 152.58 13.38
I wanted to run a OLS regression but all the parameters returned all nan values. And it warned:
/Users/jake/anaconda3/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in greater
return (self.a < x) & (x < self.b)
/Users/jake/anaconda3/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in less
return (self.a < x) & (x < self.b)
/Users/jake/anaconda3/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:1821: RuntimeWarning: invalid value encountered in less_equal
cond2 = cond0 & (x <= self.a)
coef std err t P>|t| [0.025 0.975]
const nan nan nan nan nan nan
NHSPSTY% Index nan nan nan nan nan nan
USURTOT Index nan nan nan nan nan nan
GLPFTOCI Index nan nan nan nan nan nan
CPTICHNG Index nan nan nan nan nan nan
My command:
import statsmodels.api as sm
model = sm.OLS(data.Growth,sm.add_constant(data.iloc[:,1:])).fit()
model.summary()
Without further information, such as the data, it is not possible to give an accurate answer. The best we can do is make informed guesses, so here I am going to list all the reasons I can think of for nan values in the output from statsmodels, along with some simple code to check for some of the:
If the dependent variable (Growth) or any predictors contain missing values, the model will propagate NaNs. OLS requires complete data; rows with NaNs are silently dropped, potentially leaving insufficient data for estimation (Allison, 2001).
Predictors that are linear combinations of others (eg., X_1 = 2 x X_2) render the design matrix X^TX singular, preventing coefficient estimation (Belsley et al., 1980). High correlation (eg., >0.99) between predictors can also destabilise estimates.
Columns with no variation (eg., all zeros) are redundant when an intercept is included. This creates rank deficiency, leading to NaN coefficients.
If the number of rows in X and y differ due to misalignment or implicit dropping of NaNs, the regression will fail.
String or object-type columns in X may be silently coerced to NaN.
When predictors (including the intercept) outnumber observations, the system is underdetermined, yielding no unique solution.
Ill-conditioned matrices (high condition number) can produce NaN due to floating-point errors.
Allison, P. D. (2001). Missing data. Sage.
Belsley, D. A., Kuh, E., & Welsch, R. E. (1980). Regression diagnostics: Identifying influential data and sources of collinearity. Wiley.
The following is some basic diagnostic code
```python
# 1. Summary of missing values
print(data.isnull().sum())
# 2. Drop rows with any NaNs
data_clean = data.dropna()
# 3. Check for constant or all-zero columns
print((data_clean.iloc[:, 1:].nunique() <= 1))
# 4. Check for object types
print(data_clean.dtypes)
# 5. Confirm dimensions
print("Observations (n):", data_clean.shape[0])
print("Predictors (p):", data_clean.shape[1] - 1)
# 6. Condition number (multicollinearity check)
import numpy as np
import statsmodels.api as sm
X = sm.add_constant(data_clean.drop(columns='Growth'))
print("Condition number:", np.linalg.cond(X))
# 7. Final model (cleaned)
y = data_clean['Growth']
model = sm.OLS(y, X).fit()
print(model.summary())
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With