I have two data frames as follows:
A = pd.DataFrame({"ID":["A", "A", "C" ,"B", "B"], "date":["06/22/2014","07/02/2014","01/01/2015","01/01/1991","08/02/1999"]})
B = pd.DataFrame({"ID":["A", "A", "C" ,"B", "B"], "date":["02/15/2015","06/30/2014","07/02/1999","10/05/1990","06/24/2014"], "value": ["3","5","1","7","8"] })
Which look like the following:
>>> A
ID date
0 A 2014-06-22
1 A 2014-07-02
2 C 2015-01-01
3 B 1991-01-01
4 B 1999-08-02
>>> B
ID date value
0 A 2015-02-15 3
1 A 2014-06-30 5
2 C 1999-07-02 1
3 B 1990-10-05 7
4 B 2014-06-24 8
I want to merge A with the values of B using the nearest date. In this example, none of the dates match, but it could the the case that some do.
The output should be something like this:
>>> C
ID date value
0 A 06/22/2014 8
1 A 07/02/2014 5
2 C 01/01/2015 3
3 B 01/01/1991 7
4 B 08/02/1999 1
It seems to me that there should be a native function in pandas that would allow this.
Note: as similar question has been asked here pandas.merge: match the nearest time stamp >= the series of timestamps
groupby(), sum() — Using a groupby() on dates column and sum() on newCases column returns a series object with length 269. This will group all duplicate dates as one and add up their respective cases. Series to Data-Frame — Groupby() function returns a series object.
The Fastest Ways As it turns out, join always tends to perform well, and merge will perform almost exactly the same given the syntax is optimal.
We can use either pandas. merge() or DataFrame. merge() to merge multiple Dataframes. Merging multiple Dataframes is similar to SQL join and supports different types of join inner , left , right , outer , cross .
You can use reindex
with method='nearest'
and then merge
:
A['date'] = pd.to_datetime(A.date)
B['date'] = pd.to_datetime(B.date)
A.sort_values('date', inplace=True)
B.sort_values('date', inplace=True)
B1 = B.set_index('date').reindex(A.set_index('date').index, method='nearest').reset_index()
print (B1)
print (pd.merge(A,B1, on='date'))
ID_x date ID_y value
0 B 1991-01-01 B 7
1 B 1999-08-02 C 1
2 A 2014-06-22 B 8
3 A 2014-07-02 A 5
4 C 2015-01-01 A 3
You can also add parameter suffixes
:
print (pd.merge(A,B1, on='date', suffixes=('_', '')))
ID_ date ID value
0 B 1991-01-01 B 7
1 B 1999-08-02 C 1
2 A 2014-06-22 B 8
3 A 2014-07-02 A 5
4 C 2015-01-01 A 3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With