I have various time series, that I want to correlate - or rather, cross-correlate - with each other, to find out at which time lag the correlation factor is the greatest. I found various questions and answers/links discussing how to do it with numpy, but those would mean that I have to turn my dataframes into numpy arrays. And since my time series often cover different periods, I am afraid that I will run into chaos. Edit The issue I am having with all the numpy/scipy methods, is that they seem to lack awareness of the timeseries nature of my data. When I correlate a time series that starts in say 1940 with one that starts in 1970, pandas <code>corr</code> knows this, whereas <code>np.correlate</code> just produces a 1020 entries (length of the longer series) array full of nan. The various Q's on this subject indicate that there should be a way to solve the different length issue, but so far, I have seen no indication on how to use it for specific time periods. I just need to shift by 12 months in increments of 1, for seeing the time of maximum correlation within one year. Edit2 Some minimal sample data: <pre class="prettyprint"><code>import pandas as pd import numpy as np dfdates1 = pd.date_range('01/01/1980', '01/01/2000', freq = 'MS') dfdata1 = (np.random.random_integers(-30,30,(len(dfdates1)))/10.0) #My real data is from measurements, but random between -3 and 3 is fitting df1 = pd.DataFrame(dfdata1, index = dfdates1) dfdates2 = pd.date_range('03/01/1990', '02/01/2013', freq = 'MS') dfdata2 = (np.random.random_integers(-30,30,(len(dfdates2)))/10.0) df2 = pd.DataFrame(dfdata2, index = dfdates2) </code></pre> Due to various processing steps, those dfs end up changed into df that are indexed from 1940 to 2015. this should reproduce this: <pre class="prettyprint"><code>bigdates = pd.date_range('01/01/1940', '01/01/2015', freq = 'MS') big1 = pd.DataFrame(index = bigdates) big2 = pd.DataFrame(index = bigdates) big1 = pd.concat([big1, df1],axis = 1) big2 = pd.concat([big2, df2],axis = 1) </code></pre> This is what I get when I correlate with pandas and shift one dataset: <pre class="prettyprint"><code>In [451]: corr_coeff_0 = big1[0].corr(big2[0]) In [452]: corr_coeff_0 Out[452]: 0.030543266378853299 In [453]: big2_shift = big2.shift(1) In [454]: corr_coeff_1 = big1[0].corr(big2_shift[0]) In [455]: corr_coeff_1 Out[455]: 0.020788314779320523 </code></pre> And trying scipy: <pre class="prettyprint"><code>In [456]: scicorr = scipy.signal.correlate(big1,big2,mode="full") In [457]: scicorr Out[457]: array([[ nan], [ nan], [ nan], ..., [ nan], [ nan], [ nan]]) </code></pre> which according to <code>whos</code> is <pre class="prettyprint"><code>scicorr ndarray 1801x1: 1801 elems, type `float64`, 14408 bytes </code></pre> But I'd just like to have 12 entries. /Edit2 The idea I have come up with, is to implement a time-lag-correlation myself, like so: <pre class="prettyprint"><code>corr_coeff_0 = df1['Data'].corr(df2['Data']) df1_1month = df1.shift(1) corr_coeff_1 = df1_1month['Data'].corr(df2['Data']) df1_6month = df1.shift(6) corr_coeff_6 = df1_6month['Data'].corr(df2['Data']) ...and so on </code></pre> But this is probably slow, and I am probably trying to reinvent the wheel here. Edit The above approach seems to work, and I have put it into a loop, to go through all 12 months of a year, but I still would prefer a built in method.

As far as I can tell, there isn't a built in method that does exactly what you are asking. But if you look at the source code for the pandas Series method <code>autocorr</code>, you can see you've got the right idea: <pre class="prettyprint"><code>def autocorr(self, lag=1): """ Lag-N autocorrelation Parameters ---------- lag : int, default 1 Number of lags to apply before performing autocorrelation. Returns ------- autocorr : float """ return self.corr(self.shift(lag)) </code></pre> So a simple timelagged cross covariance function would be <pre class="prettyprint"><code>def crosscorr(datax, datay, lag=0): """ Lag-N cross correlation. Parameters ---------- lag : int, default 0 datax, datay : pandas.Series objects of equal length Returns ---------- crosscorr : float """ return datax.corr(datay.shift(lag)) </code></pre> Then if you wanted to look at the cross correlations at each month, you could do <pre class="prettyprint"><code> xcov_monthly = [crosscorr(datax, datay, lag=i) for i in range(12)] </code></pre>

Cross-correlation (time-lag-correlation) with pandas?

Tags:

python

pandas

numpy

correlation

cross-correlation

I have various time series, that I want to correlate - or rather, cross-correlate - with each other, to find out at which time lag the correlation factor is the greatest.

I found various questions and answers/links discussing how to do it with numpy, but those would mean that I have to turn my dataframes into numpy arrays. And since my time series often cover different periods, I am afraid that I will run into chaos.

Edit

The issue I am having with all the numpy/scipy methods, is that they seem to lack awareness of the timeseries nature of my data. When I correlate a time series that starts in say 1940 with one that starts in 1970, pandas corr knows this, whereas np.correlate just produces a 1020 entries (length of the longer series) array full of nan.

The various Q's on this subject indicate that there should be a way to solve the different length issue, but so far, I have seen no indication on how to use it for specific time periods. I just need to shift by 12 months in increments of 1, for seeing the time of maximum correlation within one year.

Edit2

Some minimal sample data:

import pandas as pd import numpy as np dfdates1 = pd.date_range('01/01/1980', '01/01/2000', freq = 'MS') dfdata1 = (np.random.random_integers(-30,30,(len(dfdates1)))/10.0) #My real data is from measurements, but random between -3 and 3 is fitting df1 = pd.DataFrame(dfdata1, index = dfdates1) dfdates2 = pd.date_range('03/01/1990', '02/01/2013', freq = 'MS') dfdata2 = (np.random.random_integers(-30,30,(len(dfdates2)))/10.0) df2 = pd.DataFrame(dfdata2, index = dfdates2)

Due to various processing steps, those dfs end up changed into df that are indexed from 1940 to 2015. this should reproduce this:

bigdates = pd.date_range('01/01/1940', '01/01/2015', freq = 'MS') big1 = pd.DataFrame(index = bigdates) big2 = pd.DataFrame(index = bigdates) big1 = pd.concat([big1, df1],axis = 1) big2 = pd.concat([big2, df2],axis = 1)

This is what I get when I correlate with pandas and shift one dataset:

In [451]: corr_coeff_0 = big1[0].corr(big2[0]) In [452]: corr_coeff_0 Out[452]: 0.030543266378853299 In [453]: big2_shift = big2.shift(1) In [454]: corr_coeff_1 = big1[0].corr(big2_shift[0]) In [455]: corr_coeff_1 Out[455]: 0.020788314779320523

And trying scipy:

In [456]: scicorr = scipy.signal.correlate(big1,big2,mode="full") In [457]: scicorr Out[457]:  array([[ nan],        [ nan],        [ nan],        ...,         [ nan],        [ nan],        [ nan]])

which according to whos is

scicorr               ndarray                       1801x1: 1801 elems, type `float64`, 14408 bytes

But I'd just like to have 12 entries. /Edit2

The idea I have come up with, is to implement a time-lag-correlation myself, like so:

corr_coeff_0 = df1['Data'].corr(df2['Data']) df1_1month = df1.shift(1) corr_coeff_1 = df1_1month['Data'].corr(df2['Data']) df1_6month = df1.shift(6) corr_coeff_6 = df1_6month['Data'].corr(df2['Data']) ...and so on

But this is probably slow, and I am probably trying to reinvent the wheel here. Edit The above approach seems to work, and I have put it into a loop, to go through all 12 months of a year, but I still would prefer a built in method.

205

asked Oct 16 '15 13:10

JC_CL

2 Answers

As far as I can tell, there isn't a built in method that does exactly what you are asking. But if you look at the source code for the pandas Series method autocorr, you can see you've got the right idea:

def autocorr(self, lag=1):     """     Lag-N autocorrelation      Parameters     ----------     lag : int, default 1         Number of lags to apply before performing autocorrelation.      Returns     -------     autocorr : float     """     return self.corr(self.shift(lag))

So a simple timelagged cross covariance function would be

def crosscorr(datax, datay, lag=0):     """ Lag-N cross correlation.      Parameters     ----------     lag : int, default 0     datax, datay : pandas.Series objects of equal length      Returns     ----------     crosscorr : float     """     return datax.corr(datay.shift(lag))

Then if you wanted to look at the cross correlations at each month, you could do

 xcov_monthly = [crosscorr(datax, datay, lag=i) for i in range(12)]

answered Oct 04 '22 18:10

Daniel Watkins

There is a better approach: You can create a function that shifted your dataframe first before calling the corr().

Get this dataframe like an example:

d = {'prcp': [0.1,0.2,0.3,0.0], 'stp': [0.0,0.1,0.2,0.3]} df = pd.DataFrame(data=d)  >>> df    prcp  stp 0   0.1  0.0 1   0.2  0.1 2   0.3  0.2 3   0.0  0.3

Your function to shift others columns (except the target):

def df_shifted(df, target=None, lag=0):     if not lag and not target:         return df            new = {}     for c in df.columns:         if c == target:             new[c] = df[target]         else:             new[c] = df[c].shift(periods=lag)     return  pd.DataFrame(data=new)

Supposing that your target is comparing the prcp (precipitation variable) with stp(atmospheric pressure)

If you do at the present will be:

>>> df.corr()       prcp  stp prcp   1.0 -0.2 stp   -0.2  1.0

But if you shifted 1(one) period all other columns and keep the target (prcp):

df_new = df_shifted(df, 'prcp', lag=-1)  >>> print df_new    prcp  stp 0   0.1  0.1 1   0.2  0.2 2   0.3  0.3 3   0.0  NaN

Note that now the column stp is shift one up position at period, so if you call the corr(), will be:

>>> df_new.corr()       prcp  stp prcp   1.0  1.0 stp    1.0  1.0

So, you can do with lag -1, -2, -n!!

answered Oct 04 '22 17:10

Andre Araujo

Related questions
                            
                                Using 100% of all cores with the multiprocessing module
                            
                                Automatically Rescale ylim and xlim in Matplotlib
                            
                                How to retain column headers of data frame after Pre-processing in scikit-learn
                            
                                Lambdas from a list comprehension are returning a lambda when called
                            
                                What does calling fit() multiple times on the same model do?
                            
                                How to print UTF-8 encoded text to the console in Python < 3?
                            
                                How to write to stdout AND to log file simultaneously with Popen?
                            
                                What is the safest way to removing Python framework files that are located in different place than Brew installs
                            
                                Python interface for R Programming Language [duplicate]
                            
                                Does `anaconda` create a separate PYTHONPATH variable for each new environment?
                            
                                Correct way to set new column in pandas DataFrame to avoid SettingWithCopyWarning
                            
                                How do you access an authenticated Google App Engine service from a (non-web) python client?
                            
                                Why does pip freeze report some packages in a fresh virtualenv created with --no-site-packages?
                            
                                Can you perform multi-threaded tasks within Django?
                            
                                How do I transpose dataframe in pandas without index?
                            
                                What does the "yield from" syntax do in asyncio and how is it different from "await"
                            
                                Tab completion in Python's raw_input()
                            
                                Big-O of list slicing
                            
                                What does Django's @property do?
                            
                                Simplest way of checking for string that contains a string in list? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With