Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I approximate the periodicity of a pandas time Series

Tags:

python

pandas

Is there a way to approximate the periodicity of a time series in pandas? For R, the xts objects have a method called periodicity that serves exactly this purpose. Is there an implemented method to do so?

For instance, can we infer the frequency from time series that do not specify frequency?

import pandas.io.data as web
aapl = web.get_data_yahoo("AAPL")

<class 'pandas.tseries.index.DatetimeIndex'>
[2010-01-04 00:00:00, ..., 2013-12-19 00:00:00]
Length: 999, Freq: None, Timezone: None

The frequency of this series can reasonably be approximated to be daily.

Update:

I think it might be helpful to show the source code of R's implementation of the periodicity method.

function (x, ...) 
{
    if (timeBased(x) || !is.xts(x)) 
        x <- try.xts(x, error = "'x' needs to be timeBased or xtsible")
    p <- median(diff(.index(x)))
    if (is.na(p)) 
        stop("can not calculate periodicity of 1 observation")
    units <- "days"
    scale <- "yearly"
    label <- "year"
    if (p < 60) {
        units <- "secs"
        scale <- "seconds"
        label <- "second"
    }
    else if (p < 3600) {
        units <- "mins"
        scale <- "minute"
        label <- "minute"
        p <- p/60L
    }
    else if (p < 86400) {
        units <- "hours"
        scale <- "hourly"
        label <- "hour"
    }
    else if (p == 86400) {
        scale <- "daily"
        label <- "day"
    }
    else if (p <= 604800) {
        scale <- "weekly"
        label <- "week"
    }
    else if (p <= 2678400) {
        scale <- "monthly"
        label <- "month"
    }
    else if (p <= 7948800) {
        scale <- "quarterly"
        label <- "quarter"
    }
    structure(list(difftime = structure(p, units = units, class = "difftime"), 
        frequency = p, start = start(x), end = end(x), units = units, 
        scale = scale, label = label), class = "periodicity")
}

I think this line is the key, which I don't quite understand p <- median(diff(.index(x)))

like image 296
zsljulius Avatar asked Dec 20 '13 20:12

zsljulius


People also ask

How is the length of a panda series determined?

len() method is used to determine length of each string in a Pandas series. This method is only for series of strings.

How do you get monthly averages in pandas?

Sum all the values for each day present in that month. Divide by the number of days with data for that month.

How do you calculate rate of change in pandas?

The pct_change() method of DataFrame class in pandas computes the percentage change between the rows of data. Note that, the pct_change() method calculates the percentage change only between the rows of data and not between the columns.

Is pandas good for time series?

While the time series tools provided by Pandas tend to be the most useful for data science applications, it is helpful to see their relationship to other packages used in Python.


2 Answers

This time series skips weekends (and holidays), so it really doesn't have a daily frequency to begin with. You could use asfreq to upsample it to a time series with daily frequency, however:

aapl = aapl.asfreq('D', method='ffill')

Doing so propagates forward the last observed value to dates with missing values.

Note that Pandas also has a business day frequency, so it is also possible to upsample to business days by using:

aapl = aapl.asfreq('B', method='ffill')

If you wish to automate the process of inferring the median frequency in days, then you could do this:

import pandas as pd
import numpy as np
import pandas.io.data as web
aapl = web.get_data_yahoo("AAPL")
f  = np.median(np.diff(aapl.index.values))
days = f.astype('timedelta64[D]').item().days
aapl = aapl.asfreq('{}D'.format(days), method='ffill')
print(aapl)

This code needs testing, but perhaps it comes close to the R code you posted:

import pandas as pd
import numpy as np
import pandas.io.data as web

def infer_freq(ts):
    med  = np.median(np.diff(ts.index.values))
    seconds = int(med.astype('timedelta64[s]').item().total_seconds())
    if seconds < 60:
        freq = '{}s'.format(seconds)
    elif seconds < 3600:
        freq = '{}T'.format(seconds//60)
    elif seconds < 86400:
        freq = '{}H'.format(seconds//3600)
    elif seconds < 604800:
        freq = '{}D'.format(seconds//86400)
    elif seconds < 2678400:
        freq = '{}W'.format(seconds//604800)
    elif seconds < 7948800:
        freq = '{}M'.format(seconds//2678400)
    else:
        freq = '{}Q'.format(seconds//7948800)
    return ts.asfreq(freq, method='ffill')

aapl = web.get_data_yahoo("AAPL")
print(infer_freq(aapl))
like image 135
unutbu Avatar answered Oct 31 '22 14:10

unutbu


I don't know about frequency, the only meaningful measure I can come up with is mean timedelta, for example in days:

>>> import numpy as np
>>> idx = aapl.index.values
>>> (np.roll(idx, -1) - idx)[:-1].mean()/np.timedelta64(1, 'D')
1.4478957915831596

or in hours:

>>> (np.roll(idx, -1) - idx)[:-1].mean()/np.timedelta64(1, 'h')
34.749498997995836

The same with a more pandorable expression, kudos to @DSM:

>>> aapl.index.to_series().diff().mean() / (60*60*10**9)
34.749498997995993

Sure median will be 24 hours, as most of days exist in list:

>>> aapl.index.to_series().diff().median() / (60*60*10**9)
24.0
like image 20
alko Avatar answered Oct 31 '22 15:10

alko