Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R's relevel() and factor variables in linear regression in pandas

Data:

a,b,c,d
1,5,9,red
2,6,10,blue
3,7,11,green
4,8,12,red
3,4,3,orange
3,4,3,blue
3,4,3,red

In R, if I want to construct a linear regression model that takes into account categorical data (I think they're called factor variables in R), I can simply do:

df$d = relevel(df$d, 'green')

After this, for the purpose of building the model, R will add columns for each colour, for example:

dblue
0
1
0
0
0
1
0

There will be no column for green because if all other colour values are 0, it means that green=1 (this is our reference level). Now, create a regression model:

mod = lm(a ~ b + c + d, data=df)
summary(mod)

Call:
lm(formula = a ~ b + c + d, data = rel)

Residuals:
         1          2          3          4          5          6          7 
 4.708e-16 -7.061e-16  2.219e-31  2.354e-16 -1.233e-31  7.061e-16 -7.061e-16 

Coefficients:
              Estimate Std. Error    t value Pr(>|t|)    
(Intercept) -1.600e+00  3.622e-15 -4.418e+14 1.44e-15 ***
b            1.600e+00  9.403e-16  1.702e+15 3.74e-16 ***
c           -6.000e-01  3.766e-16 -1.593e+15 4.00e-16 ***
dblue        8.829e-16  1.823e-15  4.840e-01    0.713    
dorange      1.589e-15  2.294e-15  6.930e-01    0.614    
dred         2.295e-15  1.631e-15  1.407e+00    0.393    

I am trying to achieve the same in Python Pandas. So far I've only come up with this:

d = {'a': [1,2,3,4,3,3,3], 'b': [5,6,7,8,4,4,4], 'c': [9,10,11,12,3,3,3], 'd': pd.Series(['red', 'blue', 'green', 'red', 'orange', 'blue', 'red'], dtype='category')}
df = pd.DataFrame(d)
df['d'] = pd.Categorical(df['d'], ordered=False)
for r in df['d'].cat.categories:
    if r != 'green':
        df['d%s' % r] = df['d'] == r
df = df.drop('d', 1)

It works and yields the same results, but I'm wondering if there is a method in pandas for this.

like image 369
alkamid Avatar asked Mar 22 '15 09:03

alkamid


People also ask

How do you Relevel factors in R?

To do so we use the relevel() function of the R Language. The relevel() function is used to reorder levels of a factor vector. The levels of a factor vector are re-ordered so that the level specified by the user is first and the others are moved down one step.

How does R handle categorical variables?

In descriptive statistics for categorical variables in R, the value is limited and usually based on a particular finite group. For example, a categorical variable in R can be countries, year, gender, occupation. A continuous variable, however, can take any values, from integer to decimal.

What is a factor in linear regression?

Table of contents No headers. The simplest linear regression model finds the relationship between one input variable, which is called the predictor variable, and the output, which is called the system's response. This type of model is known as a one-factor linear regression.

Can pandas do linear regression?

Pandas, NumPy, and Scikit-Learn are three Python libraries used for linear regression.


1 Answers

You could use pd.get_dummies:

import pandas as pd
d = {'a': [1,2,3,4,3,3,3], 'b': [5,6,7,8,4,4,4], 'c': [9,10,11,12,3,3,3], 
     'd': pd.Series(['red', 'blue', 'green', 'red', 'orange', 'blue', 'red'], 
                    dtype='category')}
df = pd.DataFrame(d)
dummies = pd.get_dummies(df['d'])
df = pd.concat([df, dummies], axis=1)
df = df.drop(['d', 'green'], axis=1)
print(df)

yields

   a  b   c  blue  orange  red
0  1  5   9     0       0    1
1  2  6  10     1       0    0
2  3  7  11     0       0    0
3  4  8  12     0       0    1
4  3  4   3     0       1    0
5  3  4   3     1       0    0
6  3  4   3     0       0    1

Using statsmodels,

import statsmodels.formula.api as smf
model = smf.ols('a ~ b + c + blue + orange + red', df).fit()
print(model.summary())

yields

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      a   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 2.149e+25
Date:                Sun, 22 Mar 2015   Prob (F-statistic):           1.64e-13
Time:                        05:57:33   Log-Likelihood:                 200.74
No. Observations:                   7   AIC:                            -389.5
Df Residuals:                       1   BIC:                            -389.8
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     -1.6000   6.11e-13  -2.62e+12      0.000        -1.600    -1.600
b              1.6000   1.59e-13   1.01e+13      0.000         1.600     1.600
c             -0.6000   6.36e-14  -9.44e+12      0.000        -0.600    -0.600
blue         1.11e-16   3.08e-13      0.000      1.000     -3.91e-12  3.91e-12
orange      7.994e-15   3.87e-13      0.021      0.987     -4.91e-12  4.93e-12
red         4.829e-15   2.75e-13      0.018      0.989     -3.49e-12   3.5e-12
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   0.203
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.752
Skew:                           0.200   Prob(JB):                        0.687
Kurtosis:                       1.445   Cond. No.                         85.2
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Alternatively, you could use a patsy formula to specify the dummy contrast:

import pandas as pd
import statsmodels.formula.api as smf

d = {'a': [1,2,3,4,3,3,3], 'b': [5,6,7,8,4,4,4], 'c': [9,10,11,12,3,3,3], 
     'd': ['red', 'blue', 'green', 'red', 'orange', 'blue', 'red']}
df = pd.DataFrame(d)

model = smf.ols('a ~ b + c + C(d, Treatment(reference="green"))', df).fit()
print(model.summary())

References:

  • Coding categorical data
  • Patsy: Contrast Coding Systems for categorical variables
  • patsy.Treatment
like image 83
unutbu Avatar answered Nov 14 '22 21:11

unutbu