Data:
a,b,c,d
1,5,9,red
2,6,10,blue
3,7,11,green
4,8,12,red
3,4,3,orange
3,4,3,blue
3,4,3,red
In R, if I want to construct a linear regression model that takes into account categorical data (I think they're called factor variables in R), I can simply do:
df$d = relevel(df$d, 'green')
After this, for the purpose of building the model, R will add columns for each colour, for example:
dblue
0
1
0
0
0
1
0
There will be no column for green because if all other colour values are 0, it means that green=1 (this is our reference level). Now, create a regression model:
mod = lm(a ~ b + c + d, data=df)
summary(mod)
Call:
lm(formula = a ~ b + c + d, data = rel)
Residuals:
1 2 3 4 5 6 7
4.708e-16 -7.061e-16 2.219e-31 2.354e-16 -1.233e-31 7.061e-16 -7.061e-16
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.600e+00 3.622e-15 -4.418e+14 1.44e-15 ***
b 1.600e+00 9.403e-16 1.702e+15 3.74e-16 ***
c -6.000e-01 3.766e-16 -1.593e+15 4.00e-16 ***
dblue 8.829e-16 1.823e-15 4.840e-01 0.713
dorange 1.589e-15 2.294e-15 6.930e-01 0.614
dred 2.295e-15 1.631e-15 1.407e+00 0.393
I am trying to achieve the same in Python Pandas. So far I've only come up with this:
d = {'a': [1,2,3,4,3,3,3], 'b': [5,6,7,8,4,4,4], 'c': [9,10,11,12,3,3,3], 'd': pd.Series(['red', 'blue', 'green', 'red', 'orange', 'blue', 'red'], dtype='category')}
df = pd.DataFrame(d)
df['d'] = pd.Categorical(df['d'], ordered=False)
for r in df['d'].cat.categories:
if r != 'green':
df['d%s' % r] = df['d'] == r
df = df.drop('d', 1)
It works and yields the same results, but I'm wondering if there is a method in pandas for this.
To do so we use the relevel() function of the R Language. The relevel() function is used to reorder levels of a factor vector. The levels of a factor vector are re-ordered so that the level specified by the user is first and the others are moved down one step.
In descriptive statistics for categorical variables in R, the value is limited and usually based on a particular finite group. For example, a categorical variable in R can be countries, year, gender, occupation. A continuous variable, however, can take any values, from integer to decimal.
Table of contents No headers. The simplest linear regression model finds the relationship between one input variable, which is called the predictor variable, and the output, which is called the system's response. This type of model is known as a one-factor linear regression.
Pandas, NumPy, and Scikit-Learn are three Python libraries used for linear regression.
You could use pd.get_dummies
:
import pandas as pd
d = {'a': [1,2,3,4,3,3,3], 'b': [5,6,7,8,4,4,4], 'c': [9,10,11,12,3,3,3],
'd': pd.Series(['red', 'blue', 'green', 'red', 'orange', 'blue', 'red'],
dtype='category')}
df = pd.DataFrame(d)
dummies = pd.get_dummies(df['d'])
df = pd.concat([df, dummies], axis=1)
df = df.drop(['d', 'green'], axis=1)
print(df)
yields
a b c blue orange red
0 1 5 9 0 0 1
1 2 6 10 1 0 0
2 3 7 11 0 0 0
3 4 8 12 0 0 1
4 3 4 3 0 1 0
5 3 4 3 1 0 0
6 3 4 3 0 0 1
Using statsmodels,
import statsmodels.formula.api as smf
model = smf.ols('a ~ b + c + blue + orange + red', df).fit()
print(model.summary())
yields
OLS Regression Results
==============================================================================
Dep. Variable: a R-squared: 1.000
Model: OLS Adj. R-squared: 1.000
Method: Least Squares F-statistic: 2.149e+25
Date: Sun, 22 Mar 2015 Prob (F-statistic): 1.64e-13
Time: 05:57:33 Log-Likelihood: 200.74
No. Observations: 7 AIC: -389.5
Df Residuals: 1 BIC: -389.8
Df Model: 5
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept -1.6000 6.11e-13 -2.62e+12 0.000 -1.600 -1.600
b 1.6000 1.59e-13 1.01e+13 0.000 1.600 1.600
c -0.6000 6.36e-14 -9.44e+12 0.000 -0.600 -0.600
blue 1.11e-16 3.08e-13 0.000 1.000 -3.91e-12 3.91e-12
orange 7.994e-15 3.87e-13 0.021 0.987 -4.91e-12 4.93e-12
red 4.829e-15 2.75e-13 0.018 0.989 -3.49e-12 3.5e-12
==============================================================================
Omnibus: nan Durbin-Watson: 0.203
Prob(Omnibus): nan Jarque-Bera (JB): 0.752
Skew: 0.200 Prob(JB): 0.687
Kurtosis: 1.445 Cond. No. 85.2
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Alternatively, you could use a patsy formula to specify the dummy contrast:
import pandas as pd
import statsmodels.formula.api as smf
d = {'a': [1,2,3,4,3,3,3], 'b': [5,6,7,8,4,4,4], 'c': [9,10,11,12,3,3,3],
'd': ['red', 'blue', 'green', 'red', 'orange', 'blue', 'red']}
df = pd.DataFrame(d)
model = smf.ols('a ~ b + c + C(d, Treatment(reference="green"))', df).fit()
print(model.summary())
References:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With