R's relevel() and factor variables in linear regression in pandas

Tags:

Data:

a,b,c,d
1,5,9,red
2,6,10,blue
3,7,11,green
4,8,12,red
3,4,3,orange
3,4,3,blue
3,4,3,red

In R, if I want to construct a linear regression model that takes into account categorical data (I think they're called factor variables in R), I can simply do:

df$d = relevel(df$d, 'green')

After this, for the purpose of building the model, R will add columns for each colour, for example:

dblue
0
1
0
0
0
1
0

There will be no column for green because if all other colour values are 0, it means that green=1 (this is our reference level). Now, create a regression model:

mod = lm(a ~ b + c + d, data=df)
summary(mod)

Call:
lm(formula = a ~ b + c + d, data = rel)

Residuals:
         1          2          3          4          5          6          7 
 4.708e-16 -7.061e-16  2.219e-31  2.354e-16 -1.233e-31  7.061e-16 -7.061e-16 

Coefficients:
              Estimate Std. Error    t value Pr(>|t|)    
(Intercept) -1.600e+00  3.622e-15 -4.418e+14 1.44e-15 ***
b            1.600e+00  9.403e-16  1.702e+15 3.74e-16 ***
c           -6.000e-01  3.766e-16 -1.593e+15 4.00e-16 ***
dblue        8.829e-16  1.823e-15  4.840e-01    0.713    
dorange      1.589e-15  2.294e-15  6.930e-01    0.614    
dred         2.295e-15  1.631e-15  1.407e+00    0.393

I am trying to achieve the same in Python Pandas. So far I've only come up with this:

d = {'a': [1,2,3,4,3,3,3], 'b': [5,6,7,8,4,4,4], 'c': [9,10,11,12,3,3,3], 'd': pd.Series(['red', 'blue', 'green', 'red', 'orange', 'blue', 'red'], dtype='category')}
df = pd.DataFrame(d)
df['d'] = pd.Categorical(df['d'], ordered=False)
for r in df['d'].cat.categories:
    if r != 'green':
        df['d%s' % r] = df['d'] == r
df = df.drop('d', 1)

It works and yields the same results, but I'm wondering if there is a method in pandas for this.

369

asked Mar 22 '15 09:03

alkamid

1 Answers

You could use pd.get_dummies:

import pandas as pd
d = {'a': [1,2,3,4,3,3,3], 'b': [5,6,7,8,4,4,4], 'c': [9,10,11,12,3,3,3], 
     'd': pd.Series(['red', 'blue', 'green', 'red', 'orange', 'blue', 'red'], 
                    dtype='category')}
df = pd.DataFrame(d)
dummies = pd.get_dummies(df['d'])
df = pd.concat([df, dummies], axis=1)
df = df.drop(['d', 'green'], axis=1)
print(df)

yields

   a  b   c  blue  orange  red
0  1  5   9     0       0    1
1  2  6  10     1       0    0
2  3  7  11     0       0    0
3  4  8  12     0       0    1
4  3  4   3     0       1    0
5  3  4   3     1       0    0
6  3  4   3     0       0    1

Using statsmodels,

import statsmodels.formula.api as smf
model = smf.ols('a ~ b + c + blue + orange + red', df).fit()
print(model.summary())

yields

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      a   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 2.149e+25
Date:                Sun, 22 Mar 2015   Prob (F-statistic):           1.64e-13
Time:                        05:57:33   Log-Likelihood:                 200.74
No. Observations:                   7   AIC:                            -389.5
Df Residuals:                       1   BIC:                            -389.8
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     -1.6000   6.11e-13  -2.62e+12      0.000        -1.600    -1.600
b              1.6000   1.59e-13   1.01e+13      0.000         1.600     1.600
c             -0.6000   6.36e-14  -9.44e+12      0.000        -0.600    -0.600
blue         1.11e-16   3.08e-13      0.000      1.000     -3.91e-12  3.91e-12
orange      7.994e-15   3.87e-13      0.021      0.987     -4.91e-12  4.93e-12
red         4.829e-15   2.75e-13      0.018      0.989     -3.49e-12   3.5e-12
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   0.203
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.752
Skew:                           0.200   Prob(JB):                        0.687
Kurtosis:                       1.445   Cond. No.                         85.2
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Alternatively, you could use a patsy formula to specify the dummy contrast:

import pandas as pd
import statsmodels.formula.api as smf

d = {'a': [1,2,3,4,3,3,3], 'b': [5,6,7,8,4,4,4], 'c': [9,10,11,12,3,3,3], 
     'd': ['red', 'blue', 'green', 'red', 'orange', 'blue', 'red']}
df = pd.DataFrame(d)

model = smf.ols('a ~ b + c + C(d, Treatment(reference="green"))', df).fit()
print(model.summary())

References:

Coding categorical data
Patsy: Contrast Coding Systems for categorical variables
patsy.Treatment

answered Nov 14 '22 21:11

unutbu

Related questions
                            
                                Numpy: get 1D array as 2D array without reshape
                            
                                Python Pyramid periodic task
                            
                                Understanding A* heuristics for single goal maze
                            
                                Share choices across Django apps
                            
                                label matplotlib imshow axes with strings
                            
                                cython: memory view of ndarray of strings (or direct ndarray indexing)
                            
                                Django compilemessages doesn't create .mo files
                            
                                Are null bytes allowed in unicode strings in PostgreSQL via Python?
                            
                                How to unserstand the code using izip_longest to chunk a list?
                            
                                django: use namedtuple instead of dict for **kwargs?
                            
                                Creating a numpy array in C from an allocated array is causing memory leaks
                            
                                A variable shared between views and initialized in AppConfig
                            
                                My python installation is broken/corrupted. How do I fix it?
                            
                                Why are CSV files smaller than HDF5 files when writing with Pandas?
                            
                                How to build arguments for a python function in a variable
                            
                                how to install python packages for brew installed pythons
                            
                                Calling function with optional arguments with PyObject_CallMethod
                            
                                How to expose a numpy array from c array in cython?
                            
                                Amazon S3 multiple object delete using aws or boto
                            
                                how to make a plotly legend span two columns

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

R's relevel() and factor variables in linear regression in pandas

Tags:

python

pandas

r

statsmodels

alkamid

People also ask

1 Answers

unutbu

Recent Activity

Donate For Us