I have a pandas data frame and I have done single linear regression on it. Using the following code I have predicted the response values for some selected predictor values.
X = df['predictor1'].tolist()
y = df['response'].tolist()
X = sm.add_constant(X)
reg = sm.OLS(y, X).fit()
reg.predict(pd.DataFrame({'predictor1': 1, 'response': [12, 89, 90]}),'prediction')
Then, I have done multiple linear regression on the same data frame using the code below:
def multiple_lin_reg(df, predictors, response):
y = dataset[response]
X = dataset[predictors]
X = sm.add_constant(X)
reg = sm.OLS(y, X).fit()
print(reg.summary())
multiple_lin_reg(df, ["predictor1","predictor2"], "response")
And everything works fine. Now, I just want to predict the response value for selected values for more that one predictor. Let's say for predictor1 = [12,22,33] and predictor2 = [90,21,23]
How can I write the similar code to the single linear regression for solving this issue?
Note: I know that in R, it can be done by using the following commands but I want to do it in Python.
pridictor1C=c(12,22,33)
predictor2C=c(90,21,23)
selected_predictor_values = expand.grid(predictor1 = pridictor1C, predictor2 = pridictor2C)
lm.fit=lm(response~predictor1+predictor2 ,data=df)
predict(lm.fit, selected_predictor_values, interval ="prediction")
I think you misunderstood something in python. You can't make a response if your model was not created with the predictor. I'm not so sure but I think that some models in R allow this much easier.
To be sure we are on the same page, we need one linear regression for each group of predictors. Let me make myself clear, if you want to predict with "age", you need one linear regression. If you want to predict with ("age", "height"), you need another one. In total, we have 2.
Let's' see your code:
[1] X = df['predictor1'].tolist()
[2] y = df['response'].tolist()
[3] X = sm.add_constant(X)
[4] reg = sm.OLS(y, X).fit()
[5] reg.predict(pd.DataFrame({'predictor1': 1, 'response': [12, 89, 90]}),'prediction')
To create model with multiples predictors, you can create a list with the names that you want or any other method that you want. You just need to edit you first line like this:X = df[Your_List_of_Strings].tolist()
The others line remains the same, except the [5]. In [5], you can't make predictions with your model if your model hasn't been trained with the same predictors. With this in mind you can make the predictions replacing the line like this: reg.predict(pd.DataFrame({'predictor1': 1,'predictor2': 1, 'response': [12, 89, 90]}),'prediction')
I don't advise you to use pd.DataFrame with dictionaries (This define a dictionary on python ->'{}'). For your application, you can use the Numpy library to do this much easier. You can pass how much data as you want with NumPy, take a look at how to manipulate the shapes, and after you know that, you can pass the numpy array to create a pandas data frame setting the argument columns with the right column title and then you are ready to predict. In this case, the linear regression will give you a response equal to the number of rows on the table in the same order.
Note: The second block doesn't make any sense, your dataset variable has the wrong name, the right name isn't df? and what do you want to return with this? If you want to create n numbers of trained models with a different set of predictors, you can add return reg to your function, and make a for loop over the different sets of predictors calling your function which return the trained regressor. A list can be used to keep the trained model.
You are almost there, in fact the function multiple_lin_reg works if you switch dataset for df. Here's a reproducible snippet:
import numpy as np
import pandas as pd
import statsmodels.api as sm
num_predictors = 3
num_rows = 50
df = pd.DataFrame(
np.random.rand(num_rows, num_predictors),
columns=[f"predictor{i}" for i in range(1, num_predictors + 1)],
)
# create a response value
df["response"] = 0
for x, y in zip(range(1, num_predictors + 1), df.columns):
df["response"] = df["response"] + x * df[y]
def multiple_lin_reg(df, predictors, response):
y = df[response]
X = df[predictors]
X = sm.add_constant(X)
reg = sm.OLS(y, X).fit()
print(reg.summary())
multiple_lin_reg(
df, [f"predictor{i}" for i in range(1, num_predictors + 1)], "response"
)
# OLS Regression Results
# ==============================================================================
# Dep. Variable: response R-squared: 1.000
# Model: OLS Adj. R-squared: 1.000
# Method: Least Squares F-statistic: 2.707e+31
# Date: Sat, 05 Feb 2022 Prob (F-statistic): 0.00
# Time: 15:59:54 Log-Likelihood: 1670.3
# No. Observations: 50 AIC: -3333.
# Df Residuals: 46 BIC: -3325.
# Df Model: 3
# Covariance Type: nonrobust
# ==============================================================================
# coef std err t P>|t| [0.025 0.975]
# ------------------------------------------------------------------------------
# const 7.216e-16 3.53e-16 2.043 0.047 1.06e-17 1.43e-15
# predictor1 1.0000 4.23e-16 2.37e+15 0.000 1.000 1.000
# predictor2 2.0000 3.7e-16 5.41e+15 0.000 2.000 2.000
# predictor3 3.0000 4.04e-16 7.43e+15 0.000 3.000 3.000
# ==============================================================================
# Omnibus: 0.707 Durbin-Watson: 1.297
# Prob(Omnibus): 0.702 Jarque-Bera (JB): 0.741
# Skew: 0.262 Prob(JB): 0.690
# Kurtosis: 2.716 Cond. No. 6.44
# ==============================================================================
# Notes:
# [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
And to generate prediction you can pass the similarly shaped data to the estimated model:
# note: reg is defined inside the multiple_lin_reg function
# so return this value, if prediction should be done outside
# the function
reg.predict(pd.DataFrame(sm.add_constant(np.random.rand(num_rows, num_predictors))))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With