I need to merge two pandas dataframes on an identifier and a condition where a date in one dataframe is between two dates in the other dataframe.
Dataframe A has a date ("fdate") and an ID ("cusip"):
I need to merge this with this dataframe B:
on A.cusip==B.ncusip
and A.fdate
is between B.namedt
and B.nameenddt
.
In SQL this would be trivial, but the only way I can see how to do this in pandas is to first merge unconditionally on the identifier, and then filter on the date condition:
df = pd.merge(A, B, how='inner', left_on='cusip', right_on='ncusip') df = df[(df['fdate']>=df['namedt']) & (df['fdate']<=df['nameenddt'])]
Is this really the best way to do this? It seems that it would be much better if one could filter within the merge so as to avoid having a potentially very large dataframe after the merge but before the filter has completed.
merge() function to join the two data frames by inner join. Now, add a suffix called 'remove' for newly joined columns that have the same name in both data frames. Use the drop() function to remove the columns with the suffix 'remove'. This will ensure that identical columns don't exist in the new dataframe.
To concatenate DataFrames, use the concat() method, but to ignore duplicates, use the drop_duplicates() method.
As you say, this is pretty easy in SQL, so why not do it in SQL?
import pandas as pd import sqlite3 #We'll use firelynx's tables: presidents = pd.DataFrame({"name": ["Bush", "Obama", "Trump"], "president_id":[43, 44, 45]}) terms = pd.DataFrame({'start_date': pd.date_range('2001-01-20', periods=5, freq='48M'), 'end_date': pd.date_range('2005-01-21', periods=5, freq='48M'), 'president_id': [43, 43, 44, 44, 45]}) war_declarations = pd.DataFrame({"date": [datetime(2001, 9, 14), datetime(2003, 3, 3)], "name": ["War in Afghanistan", "Iraq War"]}) #Make the db in memory conn = sqlite3.connect(':memory:') #write the tables terms.to_sql('terms', conn, index=False) presidents.to_sql('presidents', conn, index=False) war_declarations.to_sql('wars', conn, index=False) qry = ''' select start_date PresTermStart, end_date PresTermEnd, wars.date WarStart, presidents.name Pres from terms join wars on date between start_date and end_date join presidents on terms.president_id = presidents.president_id ''' df = pd.read_sql_query(qry, conn)
df:
PresTermStart PresTermEnd WarStart Pres 0 2001-01-31 00:00:00 2005-01-31 00:00:00 2001-09-14 00:00:00 Bush 1 2001-01-31 00:00:00 2005-01-31 00:00:00 2003-03-03 00:00:00 Bush
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With