Is there a way to approximate the periodicity of a time series in pandas? For R, the xts
objects have a method called periodicity
that serves exactly this purpose. Is there an implemented method to do so?
For instance, can we infer the frequency from time series that do not specify frequency?
import pandas.io.data as web
aapl = web.get_data_yahoo("AAPL")
<class 'pandas.tseries.index.DatetimeIndex'>
[2010-01-04 00:00:00, ..., 2013-12-19 00:00:00]
Length: 999, Freq: None, Timezone: None
The frequency of this series can reasonably be approximated to be daily.
Update:
I think it might be helpful to show the source code of R's implementation of the periodicity method.
function (x, ...)
{
if (timeBased(x) || !is.xts(x))
x <- try.xts(x, error = "'x' needs to be timeBased or xtsible")
p <- median(diff(.index(x)))
if (is.na(p))
stop("can not calculate periodicity of 1 observation")
units <- "days"
scale <- "yearly"
label <- "year"
if (p < 60) {
units <- "secs"
scale <- "seconds"
label <- "second"
}
else if (p < 3600) {
units <- "mins"
scale <- "minute"
label <- "minute"
p <- p/60L
}
else if (p < 86400) {
units <- "hours"
scale <- "hourly"
label <- "hour"
}
else if (p == 86400) {
scale <- "daily"
label <- "day"
}
else if (p <= 604800) {
scale <- "weekly"
label <- "week"
}
else if (p <= 2678400) {
scale <- "monthly"
label <- "month"
}
else if (p <= 7948800) {
scale <- "quarterly"
label <- "quarter"
}
structure(list(difftime = structure(p, units = units, class = "difftime"),
frequency = p, start = start(x), end = end(x), units = units,
scale = scale, label = label), class = "periodicity")
}
I think this line is the key, which I don't quite understand
p <- median(diff(.index(x)))
len() method is used to determine length of each string in a Pandas series. This method is only for series of strings.
Sum all the values for each day present in that month. Divide by the number of days with data for that month.
The pct_change() method of DataFrame class in pandas computes the percentage change between the rows of data. Note that, the pct_change() method calculates the percentage change only between the rows of data and not between the columns.
While the time series tools provided by Pandas tend to be the most useful for data science applications, it is helpful to see their relationship to other packages used in Python.
This time series skips weekends (and holidays), so it really doesn't have a daily frequency to begin with. You could use asfreq
to upsample it to a time series with daily frequency, however:
aapl = aapl.asfreq('D', method='ffill')
Doing so propagates forward the last observed value to dates with missing values.
Note that Pandas also has a business day frequency, so it is also possible to upsample to business days by using:
aapl = aapl.asfreq('B', method='ffill')
If you wish to automate the process of inferring the median frequency in days, then you could do this:
import pandas as pd
import numpy as np
import pandas.io.data as web
aapl = web.get_data_yahoo("AAPL")
f = np.median(np.diff(aapl.index.values))
days = f.astype('timedelta64[D]').item().days
aapl = aapl.asfreq('{}D'.format(days), method='ffill')
print(aapl)
This code needs testing, but perhaps it comes close to the R code you posted:
import pandas as pd
import numpy as np
import pandas.io.data as web
def infer_freq(ts):
med = np.median(np.diff(ts.index.values))
seconds = int(med.astype('timedelta64[s]').item().total_seconds())
if seconds < 60:
freq = '{}s'.format(seconds)
elif seconds < 3600:
freq = '{}T'.format(seconds//60)
elif seconds < 86400:
freq = '{}H'.format(seconds//3600)
elif seconds < 604800:
freq = '{}D'.format(seconds//86400)
elif seconds < 2678400:
freq = '{}W'.format(seconds//604800)
elif seconds < 7948800:
freq = '{}M'.format(seconds//2678400)
else:
freq = '{}Q'.format(seconds//7948800)
return ts.asfreq(freq, method='ffill')
aapl = web.get_data_yahoo("AAPL")
print(infer_freq(aapl))
I don't know about frequency, the only meaningful measure I can come up with is mean timedelta, for example in days:
>>> import numpy as np
>>> idx = aapl.index.values
>>> (np.roll(idx, -1) - idx)[:-1].mean()/np.timedelta64(1, 'D')
1.4478957915831596
or in hours:
>>> (np.roll(idx, -1) - idx)[:-1].mean()/np.timedelta64(1, 'h')
34.749498997995836
The same with a more pandorable expression, kudos to @DSM:
>>> aapl.index.to_series().diff().mean() / (60*60*10**9)
34.749498997995993
Sure median will be 24 hours, as most of days exist in list:
>>> aapl.index.to_series().diff().median() / (60*60*10**9)
24.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With