Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Appending predicted values and residuals to pandas dataframe

It's a useful and common practice to append predicted values and residuals from running a regression onto a dataframe as distinct columns. I'm new to pandas, and I'm having trouble performing this very simple operation. I know I'm missing something obvious. There was a very similar question asked about a year-and-a-half ago, but it wasn't really answered.

The dataframe currently looks something like this:

y               x1           x2   
880.37          3.17         23
716.20          4.76         26
974.79          4.17         73
322.80          8.70         72
1054.25         11.45        16

And all I'm wanting is to return a dataframe that has the predicted value and residual from y = x1 + x2 for each observation:

y               x1           x2       y_hat         res
880.37          3.17         23       840.27        40.10
716.20          4.76         26       752.60        -36.40
974.79          4.17         73       877.49        97.30
322.80          8.70         72       348.50        -25.70
1054.25         11.45        16       815.15        239.10

I've tried resolving this using statsmodels and pandas and haven't been able to solve it. Thanks in advance!

like image 436
Uncle Milton Avatar asked Aug 19 '15 16:08

Uncle Milton


3 Answers

Here is a variation on Alexander's answer using the OLS model from statsmodels instead of the pandas ols model. We can use either the formula or the array/DataFrame interface to the models.

fittedvalues and resid are pandas Series with the correct index. predict does not return a pandas Series.

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

df = pd.DataFrame({'x1': [3.17, 4.76, 4.17, 8.70, 11.45],
                   'x2': [23, 26, 73, 72, 16],
                   'y': [880.37, 716.20, 974.79, 322.80, 1054.25]},
                   index=np.arange(10, 20, 2))

result = smf.ols('y ~ x1 + x2', df).fit()
df['yhat'] = result.fittedvalues
df['resid'] = result.resid


result2 = sm.OLS(df['y'], sm.add_constant(df[['x1', 'x2']])).fit()
df['yhat2'] = result2.fittedvalues
df['resid2'] = result2.resid

# predict doesn't return pandas series and no index is available
df['predicted'] = result.predict(df)

print(df)

       x1  x2        y        yhat       resid       yhat2      resid2  \
10   3.17  23   880.37  923.949309  -43.579309  923.949309  -43.579309   
12   4.76  26   716.20  890.732201 -174.532201  890.732201 -174.532201   
14   4.17  73   974.79  656.155079  318.634921  656.155079  318.634921   
16   8.70  72   322.80  610.510952 -287.710952  610.510952 -287.710952   
18  11.45  16  1054.25  867.062458  187.187542  867.062458  187.187542   

     predicted  
10  923.949309  
12  890.732201  
14  656.155079  
16  610.510952  
18  867.062458  

As preview, there is an extended prediction method in the model results in statsmodels master (0.7), but the API is not yet settled:

>>> print(result.get_prediction().summary_frame())
          mean     mean_se  mean_ci_lower  mean_ci_upper  obs_ci_lower  \
10  923.949309  268.931939    -233.171432    2081.070051   -991.466820   
12  890.732201  211.945165     -21.194241    1802.658643   -887.328646   
14  656.155079  269.136102    -501.844105    1814.154263  -1259.791854   
16  610.510952  282.182030    -603.620329    1824.642233  -1339.874985   
18  867.062458  329.017262    -548.584564    2282.709481  -1214.750941   

    obs_ci_upper  
10   2839.365439  
12   2668.793048  
14   2572.102012  
16   2560.896890  
18   2948.875858  
like image 84
Josef Avatar answered Nov 06 '22 14:11

Josef


This should be self explanatory.

import pandas as pd

df = pd.DataFrame({'x1': [3.17, 4.76, 4.17, 8.70, 11.45],
                   'x2': [23, 26, 73, 72, 16],
                   'y': [880.37, 716.20, 974.79, 322.80, 1054.25]})
model = pd.ols(y=df.y, x=df.loc[:, ['x1', 'x2']])
df['y_hat'] = model.y_fitted
df['res'] = model.resid

>>> df
      x1  x2        y       y_hat         res
0   3.17  23   880.37  923.949309  -43.579309
1   4.76  26   716.20  890.732201 -174.532201
2   4.17  73   974.79  656.155079  318.634921
3   8.70  72   322.80  610.510952 -287.710952
4  11.45  16  1054.25  867.062458  187.187542
like image 33
Alexander Avatar answered Nov 06 '22 15:11

Alexander


So, it's polite to form your questions such that it's easy for contributors to run your code.

import pandas as pd

y_col = [880.37, 716.20, 974.79, 322.80, 1054.25]
x1_col = [3.17, 4.76, 4.17, 8.70, 11.45]
x2_col = [23, 26, 73, 72, 16]

df = pd.DataFrame()
df['y'] = y_col
df['x1'] = x1_col
df['x2'] = x2_col

Then calling df.head() yields:

         y     x1  x2
0   880.37   3.17  23
1   716.20   4.76  26
2   974.79   4.17  73
3   322.80   8.70  72
4  1054.25  11.45  16

Now for your question, it's fairly straightforward to add columns with calculated values, though I'm not agreeing with your sample data:

df['y_hat'] = df['x1'] + df['x2']
df['res'] = df['y'] - df['y_hat']

For me, these yield:

         y     x1  x2  y_hat      res
0   880.37   3.17  23  26.17   854.20
1   716.20   4.76  26  30.76   685.44
2   974.79   4.17  73  77.17   897.62
3   322.80   8.70  72  80.70   242.10
4  1054.25  11.45  16  27.45  1026.80

Hope this helps!

like image 26
Andy Kubiak Avatar answered Nov 06 '22 14:11

Andy Kubiak