I have a set of data. I have use pandas to convert them in a dummy and categorical variables respectively. So, now I want to know, how to run a multiple linear regression (I am using statsmodels) in Python?. Are there some considerations or maybe I have to indicate that the variables are dummy/ categorical in my code someway? Or maybe the transfromation of the variables is enough and I just have to run the regression as model = sm.OLS(y, X).fit()
?.
My code is the following:
datos = pd.read_csv("datos_2.csv")
df = pd.DataFrame(datos)
print(df)
I get this:
Age Gender Wage Job Classification
32 Male 450000 Professor High
28 Male 500000 Administrative High
40 Female 20000 Professor Low
47 Male 70000 Assistant Medium
50 Female 345000 Professor Medium
27 Female 156000 Assistant Low
56 Male 432000 Administrative Low
43 Female 100000 Administrative Low
Then I do: 1= Male, 0= Female and 1:Professor, 2:Administrative, 3: Assistant this way:
df['Sex_male']=df.Gender.map({'Female':0,'Male':1})
df['Job_index']=df.Job.map({'Professor':1,'Administrative':2,'Assistant':3})
print(df)
Getting this:
Age Gender Wage Job Classification Sex_male Job_index
32 Male 450000 Professor High 1 1
28 Male 500000 Administrative High 1 2
40 Female 20000 Professor Low 0 1
47 Male 70000 Assistant Medium 1 3
50 Female 345000 Professor Medium 0 1
27 Female 156000 Assistant Low 0 3
56 Male 432000 Administrative Low 1 2
43 Female 100000 Administrative Low 0 2
Now, if I would run a multiple linear regression, for example:
y = datos['Wage']
X = datos[['Sex_mal', 'Job_index','Age']]
X = sm.add_constant(X)
model1 = sm.OLS(y, X).fit()
results1=model1.summary(alpha=0.05)
print(results1)
The result is shown normally, but would it be fine? Or do I have to indicate somehow that the variables are dummy or categorical?. Please help, I am new to Python and I want to learn. Greetings from South America - Chile.
Categorical variables can absolutely used in a linear regression model.
Once a categorical variable has been recoded as a dummy variable, the dummy variable can be used in regression analysis just like any other quantitative variable.
This is because categorical independent variables (i.e., nominal and ordinal independent variables) cannot be directly entered into a multiple regression. Instead, they need to be converted into dummy variables.
You'll need to indicate that either Job
or Job_index
is a categorical variable; otherwise, in the case of Job_index
it will be treated as a continuous variable (which just happens to take values 1
, 2
, and 3
), which isn't right.
You can use a few different kinds of notation in statsmodels
, here's the formula approach, which uses C()
to indicate a categorical variable:
from statsmodels.formula.api import ols
fit = ols('Wage ~ C(Sex_male) + C(Job) + Age', data=df).fit()
fit.summary()
OLS Regression Results
==============================================================================
Dep. Variable: Wage R-squared: 0.592
Model: OLS Adj. R-squared: 0.048
Method: Least Squares F-statistic: 1.089
Date: Wed, 06 Jun 2018 Prob (F-statistic): 0.492
Time: 22:35:43 Log-Likelihood: -104.59
No. Observations: 8 AIC: 219.2
Df Residuals: 3 BIC: 219.6
Df Model: 4
Covariance Type: nonrobust
=======================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------
Intercept 3.67e+05 3.22e+05 1.141 0.337 -6.57e+05 1.39e+06
C(Sex_male)[T.1] 2.083e+05 1.39e+05 1.498 0.231 -2.34e+05 6.51e+05
C(Job)[T.Assistant] -2.167e+05 1.77e+05 -1.223 0.309 -7.8e+05 3.47e+05
C(Job)[T.Professor] -9273.0556 1.61e+05 -0.058 0.958 -5.21e+05 5.03e+05
Age -3823.7419 6850.345 -0.558 0.616 -2.56e+04 1.8e+04
==============================================================================
Omnibus: 0.479 Durbin-Watson: 1.620
Prob(Omnibus): 0.787 Jarque-Bera (JB): 0.464
Skew: -0.108 Prob(JB): 0.793
Kurtosis: 1.839 Cond. No. 215.
==============================================================================
Note: Job
and Job_index
won't use the same categorical level as a baseline, so you'll see slightly different results for the dummy coefficients at each level, even though the overall model fit remains the same.
In linear regression with categorical variables you should be careful of the Dummy Variable Trap. The Dummy Variable trap is a scenario in which the independent variables are multicollinear - a scenario in which two or more variables are highly correlated; in simple terms one variable can be predicted from the others. This can produce singularity of a model, meaning your model just won't work. Read about it here
Idea is to use dummy variable encoding with drop_first=True
, this will omit one column from each category after converting categorical variable into dummy/indicator variables. You WILL NOT lose and relevant information by doing that simply because your all point in dataset can fully be explained by rest of the features.
Here is complete code on how you can do it for your jobs dataset
So you have your X features:
Age, Gender, Job, Classification
And one numerical features that you are trying to predict:
Wage
First you need to split your initial dataset on input variables and prediction, assuming its pandas dataframe it would look like this:
Input variables (your dataset is bit different but whole code remains the same, you will put every column from dataset in X, except one that will go to Y. pd.get_dummies works without problem like that - it will just convert categorical variables and it won't touch numerical):
X = jobs[['Age','Gender','Job','Classification']]
Prediction:
Y = jobs['Wage']
Convert categorical variable into dummy/indicator variables and drop one in each category:
X = pd.get_dummies(data=X, drop_first=True)
So now if you check shape of X (X.shape) with drop_first=True
you will see that it has 4 columns less - one for each of your categorical variables.
You can now continue to use them in your linear model. For scikit-learn implementation it could look like this:
from sklearn import linear_model
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = .20, random_state = 40)
regr = linear_model.LinearRegression() # Do not use fit_intercept = False if you have removed 1 column after dummy encoding
regr.fit(X_train, Y_train)
predicted = regr.predict(X_test)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With