I have some Particulate Matter sensors and CSVs with time series like:
Sensor A:
date value
date
2017-11-30 00:00:00 30/11/17 0.00 49
2017-11-30 00:02:00 30/11/17 0.02 51
2017-11-30 00:03:00 30/11/17 0.03 54
2017-11-30 00:05:00 30/11/17 0.05 57
2017-11-30 00:07:00 30/11/17 0.07 53
2017-11-30 00:08:00 30/11/17 0.08 55
2017-11-30 00:10:00 30/11/17 0.10 55
2017-11-30 00:12:00 30/11/17 0.12 58
2017-11-30 00:13:00 30/11/17 0.13 57
2017-11-30 00:15:00 30/11/17 0.15 58
....
2018-02-06 09:30:00 6/2/18 9.30 33
2018-02-06 09:32:00 6/2/18 9.32 31
2018-02-06 09:33:00 6/2/18 9.33 34
2018-02-06 09:35:00 6/2/18 9.35 32
2018-02-06 09:37:00 6/2/18 9.37 33
2018-02-06 09:38:00 6/2/18 9.38 30
I set date as index with:
df.index = pd.to_datetime(df['date'], format='%d/%m/%y %H.%M')
I would like to correlate different time windows between data from the same sensor AND from different sensor in similar time windows. I expect to know if I have same increase/decrease behaviour in some part of the day/days. After setting "date index" I'm able to get "All PM value from 9am to 10am everyday from sensor A"
df.between_time('9:00','10:00')
1) Problem 1: How to check correlation from same sensor but different days: I filtered data 9/10am from two days in two DataFrame, but not always they're taken exactly at the same minute. I may have situations like this:
01-01-2018 (df01 - I removed data column)
2018-01-01 09:05:00 11
2018-01-01 09:07:00 11
2018-01-01 09:09:00 10
....
02-01-2018 (df02)
2018-02-01 09:05:00 67
2018-02-01 09:07:00 68
2018-02-01 09:08:00 67
....
Should I rename data column? I actually care that the third value from 01/01/2018 will correlate with the third value from the second window.
df01.corr(df02)
returns
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
2) Problem 2: Correlate between different sensors In this case I have 2 CVS files with PM values from two sensors. As Problem1 I would like to correlate same time windows from them. Even in this case I expect some "Casual lag" between data but errors between minutes are fine and I want to check just values 'at right position'. Example:
Sensor A:
date value
date
2017-11-30 00:00:00 30/11/17 0.00 49
2017-11-30 00:02:00 30/11/17 0.02 51
2017-11-30 00:03:00 30/11/17 0.03 54
2017-11-30 00:05:00 30/11/17 0.05 57
Sensor B:
date value
date
2017-11-30 00:00:00 30/11/17 0.00 1
2017-11-30 00:02:00 30/11/17 0.02 40
2017-11-30 00:04:00 30/11/17 0.03 11
2017-11-30 00:05:00 30/11/17 0.05 57
AxB
date valueA valueB
date
2017-11-30 00:00:00 30/11/17 0.00 49 1
2017-11-30 00:02:00 30/11/17 0.02 51 40
2017-11-30 00:03:00 30/11/17 0.03 54 11
2017-11-30 00:05:00 30/11/17 0.05 57 57
Thank you in advance
I'll try to address both of your questions together. This looks like a job for pd.merge_asof()
, which merges on nearest-matching keys, rather than only on exact keys.
df1
date value
30/11/17 0.00 51
30/11/17 0.02 53
30/11/17 0.05 65
30/11/17 0.08 58
df2
date value
30/11/17 0.01 61
30/11/17 0.02 63
30/11/17 0.04 65
30/11/17 0.07 68
df1.date = pd.to_datetime(df1.date, format='%d/%m/%y %H.%M')
df2.date = pd.to_datetime(df2.date, format='%d/%m/%y %H.%M')
df1.set_index('date', inplace=True)
df2.set_index('date', inplace=True)
df1
value
date
2017-11-30 00:00:00 51
2017-11-30 00:02:00 53
2017-11-30 00:05:00 65
2017-11-30 00:08:00 58
df2
value
date
2017-11-30 00:01:00 61
2017-11-30 00:02:00 63
2017-11-30 00:04:00 65
2017-11-30 00:07:00 68
merged = pd.merge_asof(df1, df2, left_index=True, right_index=True, direction='nearest')
merged
value_x value_y
date
2017-11-30 00:00:00 51 61
2017-11-30 00:02:00 53 63
2017-11-30 00:05:00 65 65
2017-11-30 00:08:00 58 68
Note that df.corr()
doesn't accept data as an argument, so df1.corr(df2)
doesn't work. The corr
method computes pairwise correlation of the columns in the DataFrame you call it on (docs).
merged.corr()
value_x value_y
value_x 1.000000 0.612873
value_y 0.612873 1.000000
The above usage of pd.merge_asof
keeps the index of df1
; each row in df1
receives its closest match in df2
, with replacement, so if df2
ever has fewer rows than df1
, the result of merge_asof
will contain duplicate values from df2
. And the result will have the same number of rows as df1
.
You mentioned that you really only care to compare rows by relative position, e.g., compare the 3rd value of df1
to the 3rd value of df2
. Instead of using merge_asof
, you could simply ignore the time index once you've used it to obtain time periods of interest, and access the underlying numpy arrays with df.values
:
# Get a 2D array of shape (4, 1)
df1.values
array([[51],
[53],
[65],
[58]])
# Get a 1D array of shape (4,)
df1.values.flatten()
array([51, 53, 65, 58])
# numpy correlation matrix
pd.np.corrcoef(df1.values.flatten(), df2.values.flatten())
array([[1. , 0.61287265],
[0.61287265, 1. ]])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With