I want to merge two dataframes on three columns: email, subject and timestamp. The timestamps between the dataframes differ and I therefore need to identify the closest matching timestamp for a group of email & subject.
Below is a reproducible example using a function for closest match suggested for this question.
import numpy as np
import pandas as pd
from pandas.io.parsers import StringIO
def find_closest_date(timepoint, time_series, add_time_delta_column=True):
# takes a pd.Timestamp() instance and a pd.Series with dates in it
# calcs the delta between `timepoint` and each date in `time_series`
# returns the closest date and optionally the number of days in its time delta
deltas = np.abs(time_series - timepoint)
idx_closest_date = np.argmin(deltas)
res = {"closest_date": time_series.ix[idx_closest_date]}
idx = ['closest_date']
if add_time_delta_column:
res["closest_delta"] = deltas[idx_closest_date]
idx.append('closest_delta')
return pd.Series(res, index=idx)
a = """timestamp,email,subject
2016-07-01 10:17:00,[email protected],subject3
2016-07-01 02:01:02,[email protected],welcome
2016-07-01 14:45:04,[email protected],subject3
2016-07-01 08:14:02,[email protected],subject2
2016-07-01 16:26:35,[email protected],subject4
2016-07-01 10:17:00,[email protected],subject3
2016-07-01 02:01:02,[email protected],welcome
2016-07-01 14:45:04,[email protected],subject3
2016-07-01 08:14:02,[email protected],subject2
2016-07-01 16:26:35,[email protected],subject4
"""
b = """timestamp,email,subject,clicks,var1
2016-07-01 02:01:14,[email protected],welcome,1,1
2016-07-01 08:15:48,[email protected],subject2,2,2
2016-07-01 10:17:39,[email protected],subject3,1,7
2016-07-01 14:46:01,[email protected],subject3,1,2
2016-07-01 16:27:28,[email protected],subject4,1,2
2016-07-01 10:17:05,[email protected],subject3,0,0
2016-07-01 02:01:03,[email protected],welcome,0,0
2016-07-01 14:45:05,[email protected],subject3,0,0
2016-07-01 08:16:00,[email protected],subject2,0,0
2016-07-01 17:00:00,[email protected],subject4,0,0
"""
Notice that for [email protected] the closest matched timestamp is 10:17:39, whereas for [email protected] the closest match is 10:17:05.
a = """timestamp,email,subject
2016-07-01 10:17:00,[email protected],subject3
2016-07-01 10:17:00,[email protected],subject3
"""
b = """timestamp,email,subject,clicks,var1
2016-07-01 10:17:39,[email protected],subject3,1,7
2016-07-01 10:17:05,[email protected],subject3,0,0
"""
df1 = pd.read_csv(StringIO(a), parse_dates=['timestamp'])
df2 = pd.read_csv(StringIO(b), parse_dates=['timestamp'])
df1[['closest', 'time_bt_x_and_y']] = df1.timestamp.apply(find_closest_date, args=[df2.timestamp])
df1
df3 = pd.merge(df1, df2, left_on=['email','subject','closest'], right_on=['email','subject','timestamp'],how='left')
df3
timestamp_x email subject closest time_bt_x_and_y timestamp_y clicks var1
2016-07-01 10:17:00 [email protected] subject3 2016-07-01 10:17:05 00:00:05 NaT NaN NaN
2016-07-01 02:01:02 [email protected] welcome 2016-07-01 02:01:03 00:00:01 NaT NaN NaN
2016-07-01 14:45:04 [email protected] subject3 2016-07-01 14:45:05 00:00:01 NaT NaN NaN
2016-07-01 08:14:02 [email protected] subject2 2016-07-01 08:15:48 00:01:46 2016-07-01 08:15:48 2.0 2.0
2016-07-01 16:26:35 [email protected] subject4 2016-07-01 16:27:28 00:00:53 2016-07-01 16:27:28 1.0 2.0
2016-07-01 10:17:00 [email protected] subject3 2016-07-01 10:17:05 00:00:05 2016-07-01 10:17:05 0.0 0.0
2016-07-01 02:01:02 [email protected] welcome 2016-07-01 02:01:03 00:00:01 2016-07-01 02:01:03 0.0 0.0
2016-07-01 14:45:04 [email protected] subject3 2016-07-01 14:45:05 00:00:01 2016-07-01 14:45:05 0.0 0.0
2016-07-01 08:14:02 [email protected] subject2 2016-07-01 08:15:48 00:01:46 NaT NaN NaN
2016-07-01 16:26:35 [email protected] subject4 2016-07-01 16:27:28 00:00:53 NaT NaN NaN
The result is wrong, mainly because the closest date is incorrect since it does not take into account email & subject.
The expected result is

Amending the function to give the closest timesstamps for a given email and subject would be helpful.
df1.groupby(['email','subject'])['timestamp'].apply(find_closest_date, args=[df1.timestamp])
But that gives an error as the function is not defined for a group object. What's the best way of doing this?
Notice that if you merge df1 and df2 on email and subject, then the result
has all the possible relevant timestamp pairings:
In [108]: result = pd.merge(df1, df2, how='left', on=['email','subject'], suffixes=['', '_y']); result
Out[108]:
timestamp email subject timestamp_y clicks var1
0 2016-07-01 10:17:00 [email protected] subject3 2016-07-01 10:17:39 1 7
1 2016-07-01 10:17:00 [email protected] subject3 2016-07-01 14:46:01 1 2
2 2016-07-01 02:01:02 [email protected] welcome 2016-07-01 02:01:14 1 1
3 2016-07-01 14:45:04 [email protected] subject3 2016-07-01 10:17:39 1 7
4 2016-07-01 14:45:04 [email protected] subject3 2016-07-01 14:46:01 1 2
5 2016-07-01 08:14:02 [email protected] subject2 2016-07-01 08:15:48 2 2
6 2016-07-01 16:26:35 [email protected] subject4 2016-07-01 16:27:28 1 2
7 2016-07-01 10:17:00 [email protected] subject3 2016-07-01 10:17:05 0 0
8 2016-07-01 10:17:00 [email protected] subject3 2016-07-01 14:45:05 0 0
9 2016-07-01 02:01:02 [email protected] welcome 2016-07-01 02:01:03 0 0
10 2016-07-01 14:45:04 [email protected] subject3 2016-07-01 10:17:05 0 0
11 2016-07-01 14:45:04 [email protected] subject3 2016-07-01 14:45:05 0 0
12 2016-07-01 08:14:02 [email protected] subject2 2016-07-01 08:16:00 0 0
13 2016-07-01 16:26:35 [email protected] subject4 2016-07-01 17:00:00 0 0
You could now take the absolute value of the difference in timestamps for each row:
result['diff'] = (result['timestamp_y'] - result['timestamp']).abs()
and then use
idx = result.groupby(['timestamp','email','subject'])['diff'].idxmin()
result = result.loc[idx]
to find the rows with the minimum difference for each group based on ['timestamp','email','subject'].
import numpy as np
import pandas as pd
from pandas.io.parsers import StringIO
a = """timestamp,email,subject
2016-07-01 10:17:00,[email protected],subject3
2016-07-01 02:01:02,[email protected],welcome
2016-07-01 14:45:04,[email protected],subject3
2016-07-01 08:14:02,[email protected],subject2
2016-07-01 16:26:35,[email protected],subject4
2016-07-01 10:17:00,[email protected],subject3
2016-07-01 02:01:02,[email protected],welcome
2016-07-01 14:45:04,[email protected],subject3
2016-07-01 08:14:02,[email protected],subject2
2016-07-01 16:26:35,[email protected],subject4
"""
b = """timestamp,email,subject,clicks,var1
2016-07-01 02:01:14,[email protected],welcome,1,1
2016-07-01 08:15:48,[email protected],subject2,2,2
2016-07-01 10:17:39,[email protected],subject3,1,7
2016-07-01 14:46:01,[email protected],subject3,1,2
2016-07-01 16:27:28,[email protected],subject4,1,2
2016-07-01 10:17:05,[email protected],subject3,0,0
2016-07-01 02:01:03,[email protected],welcome,0,0
2016-07-01 14:45:05,[email protected],subject3,0,0
2016-07-01 08:16:00,[email protected],subject2,0,0
2016-07-01 17:00:00,[email protected],subject4,0,0
"""
df1 = pd.read_csv(StringIO(a), parse_dates=['timestamp'])
df2 = pd.read_csv(StringIO(b), parse_dates=['timestamp'])
result = pd.merge(df1, df2, how='left', on=['email','subject'], suffixes=['', '_y'])
result['diff'] = (result['timestamp_y'] - result['timestamp']).abs()
idx = result.groupby(['timestamp','email','subject'])['diff'].idxmin()
result = result.loc[idx].drop(['timestamp_y','diff'], axis=1)
result = result.sort_index()
print(result)
yields
timestamp email subject clicks var1
0 2016-07-01 10:17:00 [email protected] subject3 1 7
2 2016-07-01 02:01:02 [email protected] welcome 1 1
4 2016-07-01 14:45:04 [email protected] subject3 1 2
5 2016-07-01 08:14:02 [email protected] subject2 2 2
6 2016-07-01 16:26:35 [email protected] subject4 1 2
7 2016-07-01 10:17:00 [email protected] subject3 0 0
9 2016-07-01 02:01:02 [email protected] welcome 0 0
11 2016-07-01 14:45:04 [email protected] subject3 0 0
12 2016-07-01 08:14:02 [email protected] subject2 0 0
13 2016-07-01 16:26:35 [email protected] subject4 0 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With