How to get the correlation between two timeseries using Pandas

Tags:

I have two sets of temperature date, which have readings at regular (but different) time intervals. I'm trying to get the correlation between these two sets of data.

I've been playing with Pandas to try to do this. I've created two timeseries, and am using TimeSeriesA.corr(TimeSeriesB). However, if the times in the 2 timeSeries do not match up exactly (they're generally off by seconds), I get Null as an answer. I could get a decent answer if I could:

a) Interpolate/fill missing times in each TimeSeries (I know this is possible in Pandas, I just don't know how to do it)

b) strip the seconds out of python datetime objects (Set seconds to 00, without changing minutes). I'd lose a degree of accuracy, but not a huge amount

c) Use something else in Pandas to get the correlation between two timeSeries

d) Use something in python to get the correlation between two lists of floats, each float having a corresponding datetime object, taking into account the time.

Anyone have any suggestions?

716

asked Jun 24 '11 12:06

user814005

1 Answers

You have a number of options using pandas, but you have to make a decision about how it makes sense to align the data given that they don't occur at the same instants.

Use the values "as of" the times in one of the time series, here's an example:

    In [15]: ts
    Out[15]: 
    2000-01-03 00:00:00    -0.722808451504
    2000-01-04 00:00:00    0.0125041039477
    2000-01-05 00:00:00    0.777515530539
    2000-01-06 00:00:00    -0.35714026263
    2000-01-07 00:00:00    -1.55213541118
    2000-01-10 00:00:00    -0.508166334892
    2000-01-11 00:00:00    0.58016097981
    2000-01-12 00:00:00    1.50766289013
    2000-01-13 00:00:00    -1.11114968643
    2000-01-14 00:00:00    0.259320239297



    In [16]: ts2
    Out[16]: 
    2000-01-03 00:00:30    1.05595278907
    2000-01-04 00:00:30    -0.568961755792
    2000-01-05 00:00:30    0.660511172645
    2000-01-06 00:00:30    -0.0327384421979
    2000-01-07 00:00:30    0.158094407533
    2000-01-10 00:00:30    -0.321679671377
    2000-01-11 00:00:30    0.977286027619
    2000-01-12 00:00:30    -0.603541295894
    2000-01-13 00:00:30    1.15993249209
    2000-01-14 00:00:30    -0.229379534767

you can see these are off by 30 seconds. The reindex function enables you to align data while filling forward values (getting the "as of" value):

    In [17]: ts.reindex(ts2.index, method='pad')
    Out[17]: 
    2000-01-03 00:00:30    -0.722808451504
    2000-01-04 00:00:30    0.0125041039477
    2000-01-05 00:00:30    0.777515530539
    2000-01-06 00:00:30    -0.35714026263
    2000-01-07 00:00:30    -1.55213541118
    2000-01-10 00:00:30    -0.508166334892
    2000-01-11 00:00:30    0.58016097981
    2000-01-12 00:00:30    1.50766289013
    2000-01-13 00:00:30    -1.11114968643
    2000-01-14 00:00:30    0.259320239297

    In [18]: ts2.corr(ts.reindex(ts2.index, method='pad'))
    Out[18]: -0.31004148593302283

note that 'pad' is also aliased by 'ffill' (but only in the very latest version of pandas on GitHub as of this time!).

Strip seconds out of all your datetimes. The best way to do this is to use rename

    In [25]: ts2.rename(lambda date: date.replace(second=0))
    Out[25]: 
    2000-01-03 00:00:00    1.05595278907
    2000-01-04 00:00:00    -0.568961755792
    2000-01-05 00:00:00    0.660511172645
    2000-01-06 00:00:00    -0.0327384421979
    2000-01-07 00:00:00    0.158094407533
    2000-01-10 00:00:00    -0.321679671377
    2000-01-11 00:00:00    0.977286027619
    2000-01-12 00:00:00    -0.603541295894
    2000-01-13 00:00:00    1.15993249209
    2000-01-14 00:00:00    -0.229379534767

Note that if rename causes there to be duplicate dates an Exception will be thrown.

For something a little more advanced, suppose you wanted to correlate the mean value for each minute (where you have multiple observations per second):

    In [31]: ts_mean = ts.groupby(lambda date: date.replace(second=0)).mean()

    In [32]: ts2_mean = ts2.groupby(lambda date: date.replace(second=0)).mean()

    In [33]: ts_mean.corr(ts2_mean)
    Out[33]: -0.31004148593302283

These last code snippets may not work if you don't have the latest code from https://github.com/wesm/pandas. If .mean() doesn't work on a GroupBy object per above try .agg(np.mean)

Hope this helps!

189

answered Sep 21 '22 14:09

Wes McKinney

Related questions
                            
                                How can I customize python syntax highlighting in VS code?
                            
                                Is it possible to call Black as an API?
                            
                                Python's requests triggers Cloudflare's security while urllib does not
                            
                                Mysql connection pooling question: is it worth it?
                            
                                Persistent Python Command-Line History
                            
                                Numpy equivalent of MATLAB's cell array
                            
                                Is there a cross-platform way to open a file browser in Python?
                            
                                How can I prevent a Python module from importing itself?
                            
                                List fields present in a table
                            
                                Python match and return string in between
                            
                                Python: how to inherit and override
                            
                                Use of a deprecated module 'string'
                            
                                PyDev bugs with imports
                            
                                Scrapy - how to identify already scraped urls
                            
                                Django - Access ForeignKey value without hitting database
                            
                                How can I create a GzipFile instance from the “file-like object” that urllib.urlopen() returns?
                            
                                Python - NumPy - tuples as elements of an array
                            
                                Creating readable html with django templates
                            
                                Check if a function has a decorator
                            
                                Can you create an Mac OS X Service with Python? How?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to get the correlation between two timeseries using Pandas

Tags:

python

pandas

statistics

correlation

user814005

People also ask

1 Answers

Wes McKinney

Recent Activity

Donate For Us