I am new to python and programming in general, so forgive any simple mistakes/ things that should be obvious.
What I am trying to do is quite simple, I just want to fit a linear trend (1-d polynomial) to a bunch of time-series to see whether the slopes are positive or negative. Right now I am just trying to get it to work for one time series.
The problem: it seems like both pandas and numpy can't do regressions for datetimes. My date times are not regular (generally 1 day per month but not the same day) so can't use the suggestion posed in Linear Regression from Time Series Pandas
My time series csv looks like:
StationName, year, month, day, depth, NO3-N, PO4-P, TotP, TotN,
Kvarnbacken (Savaran), 2003, 2, 25, 0.5, 46, 9, 14, 451
Kvarnbacken (Savaran), 2003, 3, 18, 0.5, 64, 15, 17, 310
Kvarnbacken (Savaran), 2003, 3, 31, 0.5, 76, 7, 19, 566
so far what i have is
import datetime as dt
from scipy import stats
import numpy as np
# read in station csv file
data = pd.read_csv('Kvarnbacken (Savaran)_2003.csv')
data.head()
# set up dates to something python can recognize
data['date'] = pd.to_datetime(data.year*10000+data.month *
100+data.day, format='%Y%m%d')
I tried
slope, intercept, r_value, p_value, std_err = stats.linregress(data.date,
data.TotP)
and got the error TypeError: ufunc add cannot use operands with types dtype('
I also tried
coefP = np.polyfit(data.date, data.TotP, 1)
polyP = np.poly1d(coefP)
ys = polyP(data.date)
print 'For P: coef, poly'
print coefP
print polyP
and got the same error.
I am guessing the easiest way around this is to do something where I just count the days since the first measurement I have and then just do a regression with days_since to the total phosphorous concentration (totP) but I am not sure of the easiest way to do that or if there was another trick.
pandas contains extensive capabilities and features for working with time series data for all domains. Using the NumPy datetime64 and timedelta64 dtypes, pandas has consolidated a large number of features from other Python libraries like scikits.
Of course you can use linear regression for time series data. It's just that there are specific tools that only work for time series data that sometimes do a better job.
Pandas, NumPy, and Scikit-Learn are three Python libraries used for linear regression.
The package scikit-learn is a widely used Python library for machine learning, built on top of NumPy and some other packages. It provides the means for preprocessing data, reducing dimensionality, implementing regression, classifying, clustering, and more.
You could convert the datetime to days in the following way.
data['days_since'] = (data.date - pd.to_datetime('2003-02-25') ).astype('timedelta64[D]')
date days_since
0 2003-02-25 0
1 2003-03-18 21
2 2003-03-31 34
Now you should be able to regress as you did above.
slope, intercept, r_value, p_value, std_err = stats.linregress(data.days_since,
data.TotP)
slope, intercept
(0.1466591166477916, 13.977916194790488)
You might also want to consider other regression options such as the statsmodels package, especially if you'll be doing this sort of thing very often. (Note that x and y are reversed compared to linregress)
import statsmodels.formula.api as smf
smf.ols( 'TotP ~ days_since', data=data ).fit().params
Intercept 13.977916
days_since 0.146659
That's just a fraction of the statsmodels output btw (use summary()
instead of params
to get the extra output.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With