I am running a regression as follows (<code>df</code> is a <code>pandas</code> dataframe): <pre class="prettyprint"><code>import statsmodels.api as sm est = sm.OLS(df['p'], df[['e', 'varA', 'meanM', 'varM', 'covAM']]).fit() est.summary() </code></pre> Which gave me, among others, an R-squared of <code>0.942</code>. So then I wanted to plot the original <code>y-values</code> and the fitted values. For this, I sorted the original values: <pre class="prettyprint"><code>orig = df['p'].values fitted = est.fittedvalues.values args = np.argsort(orig) import matplotlib.pyplot as plt plt.plot(orig[args], 'bo') plt.plot(orig[args]-resid[args], 'ro') plt.show() </code></pre> This, however, gave me a graph where the values were completely off. Nothing that would suggest an R-squared of <code>0.9</code>. Therefore, I tried to calculate it manually myself: <pre class="prettyprint"><code>yBar = df['p'].mean() SSTot = df['p'].apply(lambda x: (x-yBar)**2).sum() SSReg = ((est.fittedvalues - yBar)**2).sum() 1 - SSReg/SSTot Out[79]: 0.2618159806908984 </code></pre> Am I doing something wrong? Or is there a reason why my computation is so far off what statsmodels is getting? <code>SSTot</code>, <code>SSReg</code> have values of <code>48084</code>, <code>35495</code>.

If you do not include an intercept (constant explanatory variable) in your model, statsmodels computes R-squared based on un-centred total sum of squares, ie. <pre class="prettyprint"><code>tss = (ys ** 2).sum() # un-centred total sum of squares </code></pre> as opposed to <pre class="prettyprint"><code>tss = ((ys - ys.mean())**2).sum() # centred total sum of squares </code></pre> as a result, R-squared would be much higher. This is mathematically correct. Because, R-squared should indicate how much of the variation is explained by the full-model comparing to the reduced model. If you define your model as: <pre class="prettyprint"><code>ys = beta1 . xs + beta0 + noise </code></pre> then the reduced model can be: <code>ys = beta0 + noise</code>, where the estimate for <code>beta0</code> is the sample average, thus we have: <code>noise = ys - ys.mean()</code>. That is where de-meaning comes from in a model with intercept. But from a model like: <pre class="prettyprint"><code>ys = beta . xs + noise </code></pre> you may only reduce to: <code>ys = noise</code>. Since <code>noise</code> is assumed zero-mean, you may not de-mean <code>ys</code>. Therefore, unexplained variation in the reduced model is the un-centred total sum of squares. This is documented here under <code>rsquared</code> item. Set <code>yBar</code> equal to zero, and I would expect you will get the same number.

If your model is: <pre class="prettyprint"><code>a = <yourmodel>.fit() </code></pre> Then, to compute fitted values: <pre class="prettyprint"><code>a.fittedvalues </code></pre> and to compute R squared: <pre class="prettyprint"><code>a.rsquared </code></pre>

Statsmodels: Calculate fitted values and R squared

Tags:

python

numpy

statsmodels

I am running a regression as follows (df is a pandas dataframe):

import statsmodels.api as sm
est = sm.OLS(df['p'], df[['e', 'varA', 'meanM', 'varM', 'covAM']]).fit()
est.summary()

Which gave me, among others, an R-squared of 0.942. So then I wanted to plot the original y-values and the fitted values. For this, I sorted the original values:

orig = df['p'].values
fitted = est.fittedvalues.values
args = np.argsort(orig)
import matplotlib.pyplot as plt
plt.plot(orig[args], 'bo')
plt.plot(orig[args]-resid[args], 'ro')
plt.show()

This, however, gave me a graph where the values were completely off. Nothing that would suggest an R-squared of 0.9. Therefore, I tried to calculate it manually myself:

yBar = df['p'].mean()
SSTot = df['p'].apply(lambda x: (x-yBar)**2).sum()
SSReg = ((est.fittedvalues - yBar)**2).sum()  
1 - SSReg/SSTot
Out[79]: 0.2618159806908984

Am I doing something wrong? Or is there a reason why my computation is so far off what statsmodels is getting? SSTot, SSReg have values of 48084, 35495.

896

asked Jul 20 '14 15:07

FooBar

2 Answers

If you do not include an intercept (constant explanatory variable) in your model, statsmodels computes R-squared based on un-centred total sum of squares, ie.

tss = (ys ** 2).sum()  # un-centred total sum of squares

as opposed to

tss = ((ys - ys.mean())**2).sum()  # centred total sum of squares

as a result, R-squared would be much higher.

This is mathematically correct. Because, R-squared should indicate how much of the variation is explained by the full-model comparing to the reduced model. If you define your model as:

ys = beta1 . xs + beta0 + noise

then the reduced model can be: ys = beta0 + noise, where the estimate for beta0 is the sample average, thus we have: noise = ys - ys.mean(). That is where de-meaning comes from in a model with intercept.

But from a model like:

ys = beta . xs + noise

you may only reduce to: ys = noise. Since noise is assumed zero-mean, you may not de-mean ys. Therefore, unexplained variation in the reduced model is the un-centred total sum of squares.

This is documented here under rsquared item. Set yBar equal to zero, and I would expect you will get the same number.

answered Oct 13 '22 20:10

behzad.nouri

If your model is:

a = <yourmodel>.fit()

Then, to compute fitted values:

a.fittedvalues

and to compute R squared:

a.rsquared

answered Oct 13 '22 20:10

Roy Martinez

Related questions
                            
                                How can I download a PyPI package for pip installation at a later date?
                            
                                How does HAProxy achieves its speed?
                            
                                Functions access to global variables
                            
                                cv2.createTrackbar using python
                            
                                Make a Custom Class JSON serializable
                            
                                memcache.get returns wrong object (Celery, Django)
                            
                                Adding an additional index to an existing multi-index dataframe
                            
                                add a new column to an existing csv file
                            
                                error_perm: 550 Permission denied
                            
                                Pandas: decompress date range to individual dates
                            
                                Regular expression negative lookbehind of non-fixed length
                            
                                What are Constants and Literal constants?
                            
                                load csv file to numpy and access columns by name
                            
                                pandas apply function to multiple columns and multiple rows
                            
                                NTLM authentication with Scrapy for web scraping
                            
                                add user to mongodb via python
                            
                                Pandas: How could I iterate two dataframes which have exactly same format?
                            
                                Updated Bokeh to 0.5.0, now plots all previous versions of graph in one window
                            
                                Looping over a MultiIndex in pandas
                            
                                Error converting large sparse matrix to COO

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With