I am trying to merge two dataframes on both name and the closest date (WRT the left hand dataframe). In my research I found one similar question here but it doesn't account for the name as well. From the above question it doesn't seem like there is a way to do this with merge but I can't see another way to do the two argument join that doesn't use the pandas merge function.
Is there a way to do this with merge? And if not what would be the appropriate way to do this?
I will post a copy of what I have tried but this was trying it with an exact merge on date which will not work. The most important line is the last one where I make the data3 dataframe.
data=pd.read_csv("edgar14Afacts.csv", parse_dates={"dater": [2]}, infer_datetime_format=True)
data2=pd.read_csv("sdcmergersdata.csv", parse_dates={"dater": [17]}, infer_datetime_format=True)
list(data2.columns.values)
data2.rename(columns=lambda x: x.replace('\r\n', ''), inplace=True)
data2.rename(columns=lambda x: x.replace('\n', ''), inplace=True)
data2.rename(columns=lambda x: x.replace('\r', ''), inplace=True)
data2=data2.rename(columns = {'Acquiror Name':'name'})
data2=data2.rename(columns = {'dater':'date'})
data=data.rename(columns = {'dater':'date'})
list(data2.columns.values)
data["name"]=data['name'].map(str.lower)
data2["name"]=data2['name'].map(str.lower)
data2['date'].fillna(method='pad')
data['namer1']=data['name']
data['dater1']=data['date']
data2['namer2']=data2['name']
data2['dater2']=data2['date']
print data.head()
print data2.head()
data['name'] = data['name'].map(lambda x: str(x)[:4])
data2['name'] = data2['name'].map(lambda x: str(x)[:4])
data3 = pd.merge(data, data2, how='left', on=['date','name'])
data3.to_csv("check.csv")
Perform an asof merge. This is similar to a left-join except that we match on nearest key rather than equal keys. Both DataFrames must be sorted by the key.
The concat() function can be used to concatenate two Dataframes by adding the rows of one to the other. The merge() function is equivalent to the SQL JOIN clause. 'left', 'right' and 'inner' joins are all possible.
Both join and merge can be used to combines two dataframes but the join method combines two dataframes on the basis of their indexes whereas the merge method is more versatile and allows us to specify columns beside the index to join on for both dataframes.
join(df2) instead of merge , it's much faster. It is a left merge/join.
This is super late, but hopefully its helpful for new answer seekers. I answered a similar question here
with a somewhat new method in pandas:
pandas.merge_asof()
The parameters of interest for you would be direction
,tolerance
,left_on
, and right_on
Building off @hernamesbarbara answer & data:
a = """timepoint,measure
2014-01-01 00:00:00,78
2014-01-02 00:00:00,29
2014-01-03 00:00:00,5
2014-01-04 00:00:00,73
2014-01-05 00:00:00,40
2014-01-06 00:00:00,45
2014-01-07 00:00:00,48
2014-01-08 00:00:00,2
2014-01-09 00:00:00,96
2014-01-10 00:00:00,82
2014-01-11 00:00:00,61
2014-01-12 00:00:00,68
2014-01-13 00:00:00,8
2014-01-14 00:00:00,94
2014-01-15 00:00:00,16
2014-01-16 00:00:00,31
2014-01-17 00:00:00,10
2014-01-18 00:00:00,34
2014-01-19 00:00:00,27
2014-01-20 00:00:00,58
2014-01-21 00:00:00,90
2014-01-22 00:00:00,41
2014-01-23 00:00:00,97
2014-01-24 00:00:00,7
2014-01-25 00:00:00,86
2014-01-26 00:00:00,62
2014-01-27 00:00:00,91
2014-01-28 00:00:00,0
2014-01-29 00:00:00,73
2014-01-30 00:00:00,22
2014-01-31 00:00:00,43
2014-02-01 00:00:00,87
2014-02-02 00:00:00,56
2014-02-03 00:00:00,45
2014-02-04 00:00:00,25
2014-02-05 00:00:00,92
2014-02-06 00:00:00,83
2014-02-07 00:00:00,13
2014-02-08 00:00:00,50
2014-02-09 00:00:00,48
2014-02-10 00:00:00,78"""
b = """timepoint,measure
2014-01-01 00:00:00,78
2014-01-08 00:00:00,29
2014-01-15 00:00:00,5
2014-01-22 00:00:00,73
2014-01-29 00:00:00,40
2014-02-05 00:00:00,45
2014-02-12 00:00:00,48
2014-02-19 00:00:00,2
2014-02-26 00:00:00,96
2014-03-05 00:00:00,82
2014-03-12 00:00:00,61
2014-03-19 00:00:00,68
2014-03-26 00:00:00,8
2014-04-02 00:00:00,94
"""
import pandas as pd
from pandas import read_csv
from io import StringIO
df1 = pd.read_csv(StringIO(a), parse_dates=['timepoint'])
df2 = pd.read_csv(StringIO(b), parse_dates=['timepoint'])
df1['timepoint'] = pd.to_datetime(df1['timepoint'])
df2['timepoint'] = pd.to_datetime(df2['timepoint'])
# converting this to the index so we can preserve the date_start_time columns so you can validate the merging logic
df1.index = df1['timepoint']
df2.index = df2['timepoint']
# the magic happens below, check the direction and tolerance arguments
# if you want you can make a maximum tolerance on which to merge data
tol = pd.Timedelta('3 day')
df3 = pd.merge_asof(left=df1,right=df2,right_index=True,left_index=True,direction='nearest',tolerance=tol)
df3.head()
timepoint_x measure_x timepoint_y measure_y
timepoint
2014-01-01 2014-01-01 78 2014-01-01 78
2014-01-02 2014-01-02 29 2014-01-01 78
2014-01-03 2014-01-03 5 2014-01-01 78
2014-01-04 2014-01-04 73 2014-01-01 78
2014-01-05 2014-01-05 40 2014-01-08 29
I'd also love to see the final solution you came up with to know how it shook out in the end.
One thing you can do to find the closest date might be something to calc the number of days between each date in the first DataFrame and the dates in the second DataFrame. Then you can use np.argmin
to retrieve the date with the smallest time delta.
For example:
Setup
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
from pandas.io.parsers import StringIO
Data
a = """timepoint,measure
2014-01-01 00:00:00,78
2014-01-02 00:00:00,29
2014-01-03 00:00:00,5
2014-01-04 00:00:00,73
2014-01-05 00:00:00,40
2014-01-06 00:00:00,45
2014-01-07 00:00:00,48
2014-01-08 00:00:00,2
2014-01-09 00:00:00,96
2014-01-10 00:00:00,82
2014-01-11 00:00:00,61
2014-01-12 00:00:00,68
2014-01-13 00:00:00,8
2014-01-14 00:00:00,94
2014-01-15 00:00:00,16
2014-01-16 00:00:00,31
2014-01-17 00:00:00,10
2014-01-18 00:00:00,34
2014-01-19 00:00:00,27
2014-01-20 00:00:00,58
2014-01-21 00:00:00,90
2014-01-22 00:00:00,41
2014-01-23 00:00:00,97
2014-01-24 00:00:00,7
2014-01-25 00:00:00,86
2014-01-26 00:00:00,62
2014-01-27 00:00:00,91
2014-01-28 00:00:00,0
2014-01-29 00:00:00,73
2014-01-30 00:00:00,22
2014-01-31 00:00:00,43
2014-02-01 00:00:00,87
2014-02-02 00:00:00,56
2014-02-03 00:00:00,45
2014-02-04 00:00:00,25
2014-02-05 00:00:00,92
2014-02-06 00:00:00,83
2014-02-07 00:00:00,13
2014-02-08 00:00:00,50
2014-02-09 00:00:00,48
2014-02-10 00:00:00,78"""
b = """timepoint,measure
2014-01-01 00:00:00,78
2014-01-08 00:00:00,29
2014-01-15 00:00:00,5
2014-01-22 00:00:00,73
2014-01-29 00:00:00,40
2014-02-05 00:00:00,45
2014-02-12 00:00:00,48
2014-02-19 00:00:00,2
2014-02-26 00:00:00,96
2014-03-05 00:00:00,82
2014-03-12 00:00:00,61
2014-03-19 00:00:00,68
2014-03-26 00:00:00,8
2014-04-02 00:00:00,94
"""
look at data
df1 = pd.read_csv(StringIO(a), parse_dates=['timepoint'])
df1.head()
timepoint measure
0 2014-01-01 78
1 2014-01-02 29
2 2014-01-03 5
3 2014-01-04 73
4 2014-01-05 40
df2 = pd.read_csv(StringIO(b), parse_dates=['timepoint'])
df2.head()
timepoint measure
0 2014-01-01 78
1 2014-01-08 29
2 2014-01-15 5
3 2014-01-22 73
4 2014-01-29 40
Func to find the closest date to a given date
def find_closest_date(timepoint, time_series, add_time_delta_column=True):
# takes a pd.Timestamp() instance and a pd.Series with dates in it
# calcs the delta between `timepoint` and each date in `time_series`
# returns the closest date and optionally the number of days in its time delta
deltas = np.abs(time_series - timepoint)
idx_closest_date = np.argmin(deltas)
res = {"closest_date": time_series.ix[idx_closest_date]}
idx = ['closest_date']
if add_time_delta_column:
res["closest_delta"] = deltas[idx_closest_date]
idx.append('closest_delta')
return pd.Series(res, index=idx)
df1[['closest', 'days_bt_x_and_y']] = df1.timepoint.apply(
find_closest_date, args=[df2.timepoint])
df1.head(10)
timepoint measure closest days_bt_x_and_y
0 2014-01-01 78 2014-01-01 0 days
1 2014-01-02 29 2014-01-01 1 days
2 2014-01-03 5 2014-01-01 2 days
3 2014-01-04 73 2014-01-01 3 days
4 2014-01-05 40 2014-01-08 3 days
5 2014-01-06 45 2014-01-08 2 days
6 2014-01-07 48 2014-01-08 1 days
7 2014-01-08 2 2014-01-08 0 days
8 2014-01-09 96 2014-01-08 1 days
9 2014-01-10 82 2014-01-08 2 days
Merge the two DataFrames on the new closest
date column
df3 = pd.merge(df1, df2, left_on=['closest'], right_on=['timepoint'])
colorder = [
'timepoint_x',
'closest',
'timepoint_y',
'days_bt_x_and_y',
'measure_x',
'measure_y'
]
df3 = df3.ix[:, colorder]
df3
timepoint_x closest timepoint_y days_bt_x_and_y measure_x measure_y
0 2014-01-01 2014-01-01 2014-01-01 0 days 78 78
1 2014-01-02 2014-01-01 2014-01-01 1 days 29 78
2 2014-01-03 2014-01-01 2014-01-01 2 days 5 78
3 2014-01-04 2014-01-01 2014-01-01 3 days 73 78
4 2014-01-05 2014-01-08 2014-01-08 3 days 40 29
5 2014-01-06 2014-01-08 2014-01-08 2 days 45 29
6 2014-01-07 2014-01-08 2014-01-08 1 days 48 29
7 2014-01-08 2014-01-08 2014-01-08 0 days 2 29
8 2014-01-09 2014-01-08 2014-01-08 1 days 96 29
9 2014-01-10 2014-01-08 2014-01-08 2 days 82 29
10 2014-01-11 2014-01-08 2014-01-08 3 days 61 29
11 2014-01-12 2014-01-15 2014-01-15 3 days 68 5
12 2014-01-13 2014-01-15 2014-01-15 2 days 8 5
13 2014-01-14 2014-01-15 2014-01-15 1 days 94 5
14 2014-01-15 2014-01-15 2014-01-15 0 days 16 5
15 2014-01-16 2014-01-15 2014-01-15 1 days 31 5
16 2014-01-17 2014-01-15 2014-01-15 2 days 10 5
17 2014-01-18 2014-01-15 2014-01-15 3 days 34 5
18 2014-01-19 2014-01-22 2014-01-22 3 days 27 73
19 2014-01-20 2014-01-22 2014-01-22 2 days 58 73
20 2014-01-21 2014-01-22 2014-01-22 1 days 90 73
21 2014-01-22 2014-01-22 2014-01-22 0 days 41 73
22 2014-01-23 2014-01-22 2014-01-22 1 days 97 73
23 2014-01-24 2014-01-22 2014-01-22 2 days 7 73
24 2014-01-25 2014-01-22 2014-01-22 3 days 86 73
25 2014-01-26 2014-01-29 2014-01-29 3 days 62 40
26 2014-01-27 2014-01-29 2014-01-29 2 days 91 40
27 2014-01-28 2014-01-29 2014-01-29 1 days 0 40
28 2014-01-29 2014-01-29 2014-01-29 0 days 73 40
29 2014-01-30 2014-01-29 2014-01-29 1 days 22 40
30 2014-01-31 2014-01-29 2014-01-29 2 days 43 40
31 2014-02-01 2014-01-29 2014-01-29 3 days 87 40
32 2014-02-02 2014-02-05 2014-02-05 3 days 56 45
33 2014-02-03 2014-02-05 2014-02-05 2 days 45 45
34 2014-02-04 2014-02-05 2014-02-05 1 days 25 45
35 2014-02-05 2014-02-05 2014-02-05 0 days 92 45
36 2014-02-06 2014-02-05 2014-02-05 1 days 83 45
37 2014-02-07 2014-02-05 2014-02-05 2 days 13 45
38 2014-02-08 2014-02-05 2014-02-05 3 days 50 45
39 2014-02-09 2014-02-12 2014-02-12 3 days 48 48
40 2014-02-10 2014-02-12 2014-02-12 2 days 78 48
A small addition to the hernamesbarbara's code
def find_closest_date(timepoint, time_series, add_time_delta_column=True, mode="abs"):
"""takes a pd.Timestamp() instance and a pd.Series with dates in it
calcs the delta between `timepoint` and each date in `time_series`
returns the closest date and optionally the number of days in its time delta
Parameters
----------
mode: "abs" (default), "left", "right"
closest datetime by abs, at left, at right
References
----------
.. [1] http://stackoverflow.com/a/25962323/716469
"""
deltas = time_series - timepoint
idx_closest_date = None
if mode == "abs":
idx_closest_date = np.argmin(abs(deltas))
elif mode == "left":
deltas_ = deltas[deltas <= pd.Timedelta('0 days 00:00:00.0')]
if len(deltas_):
idx_closest_date = np.argmax(deltas_)
elif mode == "right":
deltas_ = deltas[deltas >= pd.Timedelta('0 days 00:00:00.0')]
if len(deltas_):
idx_closest_date = np.argmin(deltas_)
else:
raise Exception("Mode is incorrect")
if idx_closest_date is not None:
closest_date = time_series.ix[idx_closest_date]
if add_time_delta_column:
closest_delta = deltas[idx_closest_date]
else:
closest_date = pd.NaT
if add_time_delta_column:
closest_delta = pd.Timedelta(pd.NaT)
res = {"closest_date": closest_date}
idx = ['closest_date']
if add_time_delta_column:
res["closest_delta"] = closest_delta
idx.append('closest_delta')
return pd.Series(res, index=idx)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With