Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Time series correlation with pandas

I have some Particulate Matter sensors and CSVs with time series like:

Sensor A:

                     date           value
date                                     
2017-11-30 00:00:00  30/11/17 0.00     49
2017-11-30 00:02:00  30/11/17 0.02     51
2017-11-30 00:03:00  30/11/17 0.03     54
2017-11-30 00:05:00  30/11/17 0.05     57
2017-11-30 00:07:00  30/11/17 0.07     53
2017-11-30 00:08:00  30/11/17 0.08     55
2017-11-30 00:10:00  30/11/17 0.10     55
2017-11-30 00:12:00  30/11/17 0.12     58
2017-11-30 00:13:00  30/11/17 0.13     57
2017-11-30 00:15:00  30/11/17 0.15     58
....
2018-02-06 09:30:00    6/2/18 9.30     33
2018-02-06 09:32:00    6/2/18 9.32     31
2018-02-06 09:33:00    6/2/18 9.33     34
2018-02-06 09:35:00    6/2/18 9.35     32
2018-02-06 09:37:00    6/2/18 9.37     33
2018-02-06 09:38:00    6/2/18 9.38     30

I set date as index with:

df.index = pd.to_datetime(df['date'], format='%d/%m/%y %H.%M')

I would like to correlate different time windows between data from the same sensor AND from different sensor in similar time windows. I expect to know if I have same increase/decrease behaviour in some part of the day/days. After setting "date index" I'm able to get "All PM value from 9am to 10am everyday from sensor A"

df.between_time('9:00','10:00')

1) Problem 1: How to check correlation from same sensor but different days: I filtered data 9/10am from two days in two DataFrame, but not always they're taken exactly at the same minute. I may have situations like this:

01-01-2018 (df01 - I removed data column)
2018-01-01 09:05:00     11
2018-01-01 09:07:00     11
2018-01-01 09:09:00     10
....


02-01-2018 (df02)
2018-02-01 09:05:00     67
2018-02-01 09:07:00     68
2018-02-01 09:08:00     67
....

Should I rename data column? I actually care that the third value from 01/01/2018 will correlate with the third value from the second window.

df01.corr(df02)

returns

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

2) Problem 2: Correlate between different sensors In this case I have 2 CVS files with PM values from two sensors. As Problem1 I would like to correlate same time windows from them. Even in this case I expect some "Casual lag" between data but errors between minutes are fine and I want to check just values 'at right position'. Example:

Sensor A:
                         date           value
    date                                     
    2017-11-30 00:00:00  30/11/17 0.00     49
    2017-11-30 00:02:00  30/11/17 0.02     51
    2017-11-30 00:03:00  30/11/17 0.03     54
    2017-11-30 00:05:00  30/11/17 0.05     57

Sensor B:
                         date           value
    date                                     
    2017-11-30 00:00:00  30/11/17 0.00     1
    2017-11-30 00:02:00  30/11/17 0.02     40
    2017-11-30 00:04:00  30/11/17 0.03     11
    2017-11-30 00:05:00  30/11/17 0.05     57

AxB
                         date           valueA    valueB
    date                                     
    2017-11-30 00:00:00  30/11/17 0.00     49       1
    2017-11-30 00:02:00  30/11/17 0.02     51       40
    2017-11-30 00:03:00  30/11/17 0.03     54       11
    2017-11-30 00:05:00  30/11/17 0.05     57       57

Thank you in advance

like image 600
Gorgo Avatar asked Feb 11 '18 01:02

Gorgo


1 Answers

I'll try to address both of your questions together. This looks like a job for pd.merge_asof(), which merges on nearest-matching keys, rather than only on exact keys.

Example data

df1
date            value
30/11/17 0.00   51
30/11/17 0.02   53
30/11/17 0.05   65
30/11/17 0.08   58

df2
date            value
30/11/17 0.01   61
30/11/17 0.02   63
30/11/17 0.04   65
30/11/17 0.07   68

Preprocessing

df1.date = pd.to_datetime(df1.date, format='%d/%m/%y %H.%M')
df2.date = pd.to_datetime(df2.date, format='%d/%m/%y %H.%M')
df1.set_index('date', inplace=True)
df2.set_index('date', inplace=True)

df1
                     value
date
2017-11-30 00:00:00     51
2017-11-30 00:02:00     53
2017-11-30 00:05:00     65
2017-11-30 00:08:00     58

df2
                     value
date
2017-11-30 00:01:00     61
2017-11-30 00:02:00     63
2017-11-30 00:04:00     65
2017-11-30 00:07:00     68

Merge DataFrames on nearest index match

merged = pd.merge_asof(df1, df2, left_index=True, right_index=True, direction='nearest')
merged
                         value_x  value_y
date
2017-11-30 00:00:00       51       61
2017-11-30 00:02:00       53       63
2017-11-30 00:05:00       65       65
2017-11-30 00:08:00       58       68

Correlations

Note that df.corr() doesn't accept data as an argument, so df1.corr(df2) doesn't work. The corr method computes pairwise correlation of the columns in the DataFrame you call it on (docs).

merged.corr()
          value_x   value_y
value_x  1.000000  0.612873
value_y  0.612873  1.000000

Notes

The above usage of pd.merge_asof keeps the index of df1; each row in df1 receives its closest match in df2, with replacement, so if df2 ever has fewer rows than df1, the result of merge_asof will contain duplicate values from df2. And the result will have the same number of rows as df1.

You mentioned that you really only care to compare rows by relative position, e.g., compare the 3rd value of df1 to the 3rd value of df2. Instead of using merge_asof, you could simply ignore the time index once you've used it to obtain time periods of interest, and access the underlying numpy arrays with df.values:

# Get a 2D array of shape (4, 1)
df1.values
array([[51],
       [53],
       [65],
       [58]])

# Get a 1D array of shape (4,)
df1.values.flatten()
array([51, 53, 65, 58])

# numpy correlation matrix
pd.np.corrcoef(df1.values.flatten(), df2.values.flatten())
array([[1.        , 0.61287265],
       [0.61287265, 1.        ]])
like image 81
Peter Leimbigler Avatar answered Nov 15 '22 22:11

Peter Leimbigler