I would like to get a regression with a time series as a predictor and I'm trying to follow the answer give on this SO answer (OLS with pandas: datetime index as predictor) but it no longer seems to work to the best of my knowledge.
Am I missing something or is there a new way to do this?
import pandas as pd
rng = pd.date_range('1/1/2011', periods=4, freq='H')
s = pd.Series(range(4), index = rng)
z = s.reset_index()
pd.ols(x=z["index"], y=z[0])
I'm getting this error. The error is explanatory but I'm wondering what I'm missing in reimplementing a solution that worked before.
TypeError: cannot astype a datetimelike from [datetime64[ns]] to [float64]
Generally, we use linear regression for time series analysis, it is used for predicting the result for time series as its trends. For example, If we have a dataset of time series with the help of linear regression we can predict the sales with the time.
pandas contains extensive capabilities and features for working with time series data for all domains. Using the NumPy datetime64 and timedelta64 dtypes, pandas has consolidated a large number of features from other Python libraries like scikits.
Pandas, NumPy, and Scikit-Learn are three Python libraries used for linear regression.
I'm not sure why pd.ols
is so picky there (it does appear to me that you followed the example correctly). I suspect this is due to changes in how pandas handles or stores datetime indexes but am too lazy to explore this further. Anyway, since your datetime variable differs only in the hour, you could just extract the hour with a dt
accessor:
pd.ols(x=pd.to_datetime(z["index"]).dt.hour, y=z[0])
However, that gives you an r-squared of 1, since your model is overspecified with the inclusion of an intercept (and y being a linear function of x). You could change the range
to np.random.randn
and then you'd get something that looks like normal regression results.
In [6]: z = pd.Series(np.random.randn(4), index = rng).reset_index()
pd.ols(x=pd.to_datetime(z["index"]).dt.hour, y=z[0])
Out[6]:
-------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <x> + <intercept>
Number of Observations: 4
Number of Degrees of Freedom: 2
R-squared: 0.7743
Adj R-squared: 0.6615
Rmse: 0.5156
F-stat (1, 2): 6.8626, p-value: 0.1200
Degrees of Freedom: model 1, resid 2
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
x -0.6040 0.2306 -2.62 0.1200 -1.0560 -0.1521
intercept 0.2915 0.4314 0.68 0.5689 -0.5540 1.1370
---------------------------------End of Summary---------------------------------
Alternatively, you could convert the index to an integer, although I found this didn't work very well (I'm assuming because the integers represent nanoseconds since the epoch or something like that, and hence are very large and cause precision issues), but converting to integer and dividing by a trillion or so did work and gave essentially the same results as using dt.hour
(i.e. same r-squared):
pd.ols(x=pd.to_datetime(z["index"]).astype(int)/1e12, y=z[0])
Source of the error message
FWIW, it looks like that error message is coming from something like this:
pd.to_datetime(z["index"]).astype(float)
Although a fairly obvious workaround is this:
pd.to_datetime(z["index"]).astype(int).astype(float)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With