I'm trying to do multiple regression with time series data, but when I add the time series column to my model, it ends up treating each unique value as a separate variable, like so (my 'date' column is of type datetime):
est = smf.ols(formula='r ~ spend + date', data=df).fit()
print est.summary()
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -6.249e-10 inf -0 nan nan nan
date[T.Timestamp('2014-10-08 00:00:00')] -2.571e-10 inf -0 nan nan nan
date[T.Timestamp('2014-10-15 00:00:00')] 9.441e-11 inf 0 nan nan nan
date[T.Timestamp('2014-10-22 00:00:00')] 5.619e-11 inf 0 nan nan nan
date[T.Timestamp('2014-10-29 00:00:00')] -8.035e-12 inf -0 nan nan nan
date[T.Timestamp('2014-11-05 00:00:00')] 6.334e-11 inf 0 nan nan nan
date[T.Timestamp('2014-11-12 00:00:00')] 7.9e+04 inf 0 nan nan nan
date[T.Timestamp('2014-11-19 00:00:00')] 1.58e+05 inf 0 nan nan nan
date[T.Timestamp('2014-11-26 00:00:00')] 1.58e+05 inf 0 nan nan nan
date[T.Timestamp('2014-12-03 00:00:00')] 1.58e+05 inf 0 nan nan nan
date[T.Timestamp('2014-12-10 00:00:00')] 2.28e+05 inf 0 nan nan nan
date[T.Timestamp('2014-12-17 00:00:00')] 3.28e+05 inf 0 nan nan nan
date[T.Timestamp('2014-12-24 00:00:00')] 3.705e+05 inf 0 nan nan nan
spend 2.105e-10 inf 0 nan nan nan
I also tried statsmodel's tms package, but wasn't sure what to do about 'frequencies':
ar_model = sm.tsa.AR(df, freq='1')
ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).
I'd really like to see a data sample as well as a code snippet to reproduce your error. Without that, my suggestion will not address your particular error message. It will, however, let you run a multiple regression analysis on a set of time series stored in a pandas dataframe. Assuming that you're using continuous and not categorical values in your time series, here is how I would do it using pandas and statsmodels:
A dataframe with random values:
# Imports
import pandas as pd
import numpy as np
import itertools
np.random.seed(1)
rows = 12
listVars= ['y','x1', 'x2', 'x3']
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_1 = pd.DataFrame(np.random.randint(100,150,size=(rows, len(listVars))), columns=listVars)
df_1 = df_1.set_index(rng)
print(df_1)
Output - some data to work with:
y x1 x2 x3
2017-01-01 137 143 112 108
2017-01-02 109 111 105 115
2017-01-03 100 116 101 112
2017-01-04 107 145 106 125
2017-01-05 120 137 118 120
2017-01-06 111 142 128 129
2017-01-07 114 104 123 123
2017-01-08 141 149 130 132
2017-01-09 122 113 141 109
2017-01-10 107 122 101 100
2017-01-11 117 108 124 113
2017-01-12 147 142 108 130
The function below will let you specify a source dataframe as well as a dependent variable y and a selection of independent variables x1, x2. Using statsmodels, some desired results will be stored in a dataframe. There, R2 will be of type numeric, while the regression coefficients and p-values will be lists since the numbers of these estimates will vary with the number of independent variables you wish to include in your analysis.
def LinReg(df, y, x, const):
betas = x.copy()
# Model with out without a constant
if const == True:
x = sm.add_constant(df[x])
model = sm.OLS(df[y], x).fit()
else:
model = sm.OLS(df[y], df[x]).fit()
# Estimates of R2 and p
res1 = {'Y': [y],
'R2': [format(model.rsquared, '.4f')],
'p': [model.pvalues.tolist()],
'start': [df.index[0]],
'stop': [df.index[-1]],
'obs' : [df.shape[0]],
'X': [betas]}
df_res1 = pd.DataFrame(data = res1)
# Regression Coefficients
theParams = model.params[0:]
coefs = theParams.to_frame()
df_coefs = pd.DataFrame(coefs.T)
xNames = list(df_coefs)
xValues = list(df_coefs.loc[0].values)
xValues2 = [ '%.2f' % elem for elem in xValues ]
res2 = {'Independent': [xNames],
'beta': [xValues2]}
df_res2 = pd.DataFrame(data = res2)
# All results
df_res = pd.concat([df_res1, df_res2], axis = 1)
df_res = df_res.T
df_res.columns = ['results']
return(df_res)
Here's a test run:
df_regression = LinReg(df = df, y = 'y', x = ['x1', 'x2'], const = True)
print(df_regression)
Output:
results
R2 0.3650
X [x1, x2]
Y y
obs 12
p [0.7417691742514285, 0.07989515781898897, 0.25...
start 2017-01-01 00:00:00
stop 2017-01-12 00:00:00
Independent [const, x1, x2]
coefficients [16.29, 0.47, 0.37]
Here's the whole thing for an easy copy-paste:
# Imports
import pandas as pd
import numpy as np
import statsmodels.api as sm
np.random.seed(1)
rows = 12
listVars= ['y','x1', 'x2', 'x3']
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df = pd.DataFrame(np.random.randint(100,150,size=(rows, len(listVars))), columns=listVars)
df = df.set_index(rng)
def LinReg(df, y, x, const):
betas = x.copy()
# Model with out without a constant
if const == True:
x = sm.add_constant(df[x])
model = sm.OLS(df[y], x).fit()
else:
model = sm.OLS(df[y], df[x]).fit()
# Estimates of R2 and p
res1 = {'Y': [y],
'R2': [format(model.rsquared, '.4f')],
'p': [model.pvalues.tolist()],
'start': [df.index[0]],
'stop': [df.index[-1]],
'obs' : [df.shape[0]],
'X': [betas]}
df_res1 = pd.DataFrame(data = res1)
# Regression Coefficients
theParams = model.params[0:]
coefs = theParams.to_frame()
df_coefs = pd.DataFrame(coefs.T)
xNames = list(df_coefs)
xValues = list(df_coefs.loc[0].values)
xValues2 = [ '%.2f' % elem for elem in xValues ]
res2 = {'Independent': [xNames],
'beta': [xValues2]}
df_res2 = pd.DataFrame(data = res2)
# All results
df_res = pd.concat([df_res1, df_res2], axis = 1)
df_res = df_res.T
df_res.columns = ['results']
return(df_res)
df_regression = LinReg(df = df, y = 'y', x = ['x1', 'x2'], const = True)
print(df_regression)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With