Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merge pandas dataframes where one value is between two others [duplicate]

I need to merge two pandas dataframes on an identifier and a condition where a date in one dataframe is between two dates in the other dataframe.

Dataframe A has a date ("fdate") and an ID ("cusip"):

enter image description here

I need to merge this with this dataframe B:

enter image description here

on A.cusip==B.ncusip and A.fdate is between B.namedt and B.nameenddt.

In SQL this would be trivial, but the only way I can see how to do this in pandas is to first merge unconditionally on the identifier, and then filter on the date condition:

df = pd.merge(A, B, how='inner', left_on='cusip', right_on='ncusip') df = df[(df['fdate']>=df['namedt']) & (df['fdate']<=df['nameenddt'])] 

Is this really the best way to do this? It seems that it would be much better if one could filter within the merge so as to avoid having a potentially very large dataframe after the merge but before the filter has completed.

like image 612
itzy Avatar asked Jun 03 '15 18:06

itzy


People also ask

How avoid duplicates in pandas merge?

merge() function to join the two data frames by inner join. Now, add a suffix called 'remove' for newly joined columns that have the same name in both data frames. Use the drop() function to remove the columns with the suffix 'remove'. This will ensure that identical columns don't exist in the new dataframe.

How do I merge two DataFrames in pandas and remove duplicates?

To concatenate DataFrames, use the concat() method, but to ignore duplicates, use the drop_duplicates() method.


1 Answers

As you say, this is pretty easy in SQL, so why not do it in SQL?

import pandas as pd import sqlite3  #We'll use firelynx's tables: presidents = pd.DataFrame({"name": ["Bush", "Obama", "Trump"],                            "president_id":[43, 44, 45]}) terms = pd.DataFrame({'start_date': pd.date_range('2001-01-20', periods=5, freq='48M'),                       'end_date': pd.date_range('2005-01-21', periods=5, freq='48M'),                       'president_id': [43, 43, 44, 44, 45]}) war_declarations = pd.DataFrame({"date": [datetime(2001, 9, 14), datetime(2003, 3, 3)],                                  "name": ["War in Afghanistan", "Iraq War"]}) #Make the db in memory conn = sqlite3.connect(':memory:') #write the tables terms.to_sql('terms', conn, index=False) presidents.to_sql('presidents', conn, index=False) war_declarations.to_sql('wars', conn, index=False)  qry = '''     select           start_date PresTermStart,         end_date PresTermEnd,         wars.date WarStart,         presidents.name Pres     from         terms join wars on         date between start_date and end_date join presidents on         terms.president_id = presidents.president_id     ''' df = pd.read_sql_query(qry, conn) 

df:

         PresTermStart          PresTermEnd             WarStart  Pres 0  2001-01-31 00:00:00  2005-01-31 00:00:00  2001-09-14 00:00:00  Bush 1  2001-01-31 00:00:00  2005-01-31 00:00:00  2003-03-03 00:00:00  Bush 
like image 88
cfort Avatar answered Sep 22 '22 11:09

cfort