Pandas, NumPy, and Scikit-Learn are three Python libraries used for linear regression. Scitkit-learn's LinearRegression class is able to easily instantiate, be trained, and be applied in a few lines of code. Depending on how data is loaded, accessed, and passed around, there can be some issues that will cause errors.
Summarizing DataThe describe() function computes a summary of statistics pertaining to the DataFrame columns. This function gives the mean, std and IQR values. And, function excludes the character columns and given summary about numeric columns.
Pandas is an open-source Python library designed to deal with data analysis and data manipulation. Citing the official website, “pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.”
I think you can almost do exactly what you thought would be ideal, using the statsmodels package which was one of pandas
' optional dependencies before pandas
' version 0.20.0 (it was used for a few things in pandas.stats
.)
>>> import pandas as pd
>>> import statsmodels.formula.api as sm
>>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})
>>> result = sm.ols(formula="A ~ B + C", data=df).fit()
>>> print(result.params)
Intercept 14.952480
B 0.401182
C 0.000352
dtype: float64
>>> print(result.summary())
OLS Regression Results
==============================================================================
Dep. Variable: A R-squared: 0.579
Model: OLS Adj. R-squared: 0.158
Method: Least Squares F-statistic: 1.375
Date: Thu, 14 Nov 2013 Prob (F-statistic): 0.421
Time: 20:04:30 Log-Likelihood: -18.178
No. Observations: 5 AIC: 42.36
Df Residuals: 2 BIC: 41.19
Df Model: 2
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 14.9525 17.764 0.842 0.489 -61.481 91.386
B 0.4012 0.650 0.617 0.600 -2.394 3.197
C 0.0004 0.001 0.650 0.583 -0.002 0.003
==============================================================================
Omnibus: nan Durbin-Watson: 1.061
Prob(Omnibus): nan Jarque-Bera (JB): 0.498
Skew: -0.123 Prob(JB): 0.780
Kurtosis: 1.474 Cond. No. 5.21e+04
==============================================================================
Warnings:
[1] The condition number is large, 5.21e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Note: pandas.stats
has been removed with 0.20.0
It's possible to do this with pandas.stats.ols
:
>>> from pandas.stats.api import ols
>>> df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40, 50], "C": [32, 234, 23, 23, 42523]})
>>> res = ols(y=df['A'], x=df[['B','C']])
>>> res
-------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <B> + <C> + <intercept>
Number of Observations: 5
Number of Degrees of Freedom: 3
R-squared: 0.5789
Adj R-squared: 0.1577
Rmse: 14.5108
F-stat (2, 2): 1.3746, p-value: 0.4211
Degrees of Freedom: model 2, resid 2
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
B 0.4012 0.6497 0.62 0.5999 -0.8723 1.6746
C 0.0004 0.0005 0.65 0.5826 -0.0007 0.0014
intercept 14.9525 17.7643 0.84 0.4886 -19.8655 49.7705
---------------------------------End of Summary---------------------------------
Note that you need to have statsmodels
package installed, it is used internally by the pandas.stats.ols
function.
I don't know if this is new in sklearn
or pandas
, but I'm able to pass the data frame directly to sklearn
without converting the data frame to a numpy array or any other data types.
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(df[['B', 'C']], df['A'])
>>> reg.coef_
array([ 4.01182386e-01, 3.51587361e-04])
This would require me to reformat the data into lists inside lists, which seems to defeat the purpose of using pandas in the first place.
No it doesn't, just convert to a NumPy array:
>>> data = np.asarray(df)
This takes constant time because it just creates a view on your data. Then feed it to scikit-learn:
>>> from sklearn.linear_model import LinearRegression
>>> lr = LinearRegression()
>>> X, y = data[:, 1:], data[:, 0]
>>> lr.fit(X, y)
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)
>>> lr.coef_
array([ 4.01182386e-01, 3.51587361e-04])
>>> lr.intercept_
14.952479503953672
Statsmodels kan build an OLS model with column references directly to a pandas dataframe.
Short and sweet:
model = sm.OLS(df[y], df[x]).fit()
Code details and regression summary:
# imports
import pandas as pd
import statsmodels.api as sm
import numpy as np
# data
np.random.seed(123)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)), columns=list('ABC'))
# assign dependent and independent / explanatory variables
variables = list(df.columns)
y = 'A'
x = [var for var in variables if var not in y ]
# Ordinary least squares regression
model_Simple = sm.OLS(df[y], df[x]).fit()
# Add a constant term like so:
model = sm.OLS(df[y], sm.add_constant(df[x])).fit()
model.summary()
Output:
OLS Regression Results
==============================================================================
Dep. Variable: A R-squared: 0.019
Model: OLS Adj. R-squared: -0.001
Method: Least Squares F-statistic: 0.9409
Date: Thu, 14 Feb 2019 Prob (F-statistic): 0.394
Time: 08:35:04 Log-Likelihood: -484.49
No. Observations: 100 AIC: 975.0
Df Residuals: 97 BIC: 982.8
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 43.4801 8.809 4.936 0.000 25.996 60.964
B 0.1241 0.105 1.188 0.238 -0.083 0.332
C -0.0752 0.110 -0.681 0.497 -0.294 0.144
==============================================================================
Omnibus: 50.990 Durbin-Watson: 2.013
Prob(Omnibus): 0.000 Jarque-Bera (JB): 6.905
Skew: 0.032 Prob(JB): 0.0317
Kurtosis: 1.714 Cond. No. 231.
==============================================================================
How to directly get R-squared, Coefficients and p-value:
# commands:
model.params
model.pvalues
model.rsquared
# demo:
In[1]:
model.params
Out[1]:
const 43.480106
B 0.124130
C -0.075156
dtype: float64
In[2]:
model.pvalues
Out[2]:
const 0.000003
B 0.237924
C 0.497400
dtype: float64
Out[3]:
model.rsquared
Out[2]:
0.0190
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With