Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Join two DataFrames on common columns only if the difference in a separate column is within range [-n, +n]

I have two data frames df1 and df2 as shown below:

df1

Date        BillNo.     Amount
10/08/2020  ABBCSQ1ZA   878
10/09/2020  AADC9C1Z5   11
10/12/2020  AC928Q1ZS   3998
10/14/2020  AC9268RE3   198
10/16/2020  AA171E1Z0   5490
10/19/2020  BU073C1ZW   3432

df2

Date        BillNo.     Amount
10/08/2020  ABBCSQ1ZA   876
10/11/2020  ATRC95REW   115
10/14/2020  AC9268RE3   212
10/16/2020  AA171E1Z0   5491
10/25/2020  BPO66W2LO   344

My final answer should be:

final

Date        BillNo.     Amount
10/08/2020  ABBCSQ1ZA   876
10/16/2020  AA171E1Z0   5491

How do I find common rows from both the data frame using Date BillNo. Amount when the difference in value range is between [-5,5]?

I know how to find common rows by using:

df_all = df1.merge(df2.drop_duplicates(), on=['Date', 'BillNo.', 'Amount'], 
                   how='outer', indicator=True)

However, this doesn't give the rows which are in range. Anyone who could help?

Edit: We can see in df1: 10/14/2020,AC9268RE3,198 and df2: 10/14/2020,AC9268RE3,212 the difference is 14, hence this should not be included in common rows

like image 328
Gopal Chitalia Avatar asked Dec 26 '20 12:12

Gopal Chitalia


People also ask

How to join two DataFrames on a column?

DataFrame join () method doesn’t support joining two DataFrames on columns as join () is used for indices. However, you can convert column to index and used it on join. The best approach would be using merge () method when you wanted to join on columns.

How to merge two data frames with different columns in pandas?

Let’s merge the two data frames with different columns. It is possible to join the different columns is using concat () method. Syntax: pandas.concat (objs: Union [Iterable [‘DataFrame’], Mapping [Label, ‘DataFrame’]], axis=’0′, join: str = “‘outer'”)

Is it possible to add more columns to a Dataframe?

If need add more column, nicer and better is join (not necessary delete column, left join by default), but if need add only one column map is faster. join utilizes the index to merge on unless we specify a column to use instead. However, we can only specify a column instead of the index for the 'left' dataframe.

How to specify a column instead of the Index in join?

However, we can only specify a column instead of the index for the 'left' dataframe. use join with df as the left dataframe and id as the on parameter. Note that I could have set_index ('id') on df to avoid having to use the on parameter.


3 Answers

We can merge, then perform a query to drop rows not within the range:

(df1.merge(df2, on=['Date', 'BillNo.'])
    .query('abs(Amount_x - Amount_y) <= 5')
    .drop('Amount_x', axis=1))

         Date    BillNo.  Amount_y
0  10/08/2020  ABBCSQ1ZA       876
1  10/16/2020  AA171E1Z0      5491

This works well as long as there is only one row that corresponds to a specific (Date, BillNo) combination in each frame.

like image 107
cs95 Avatar answered Sep 20 '22 20:09

cs95


You could use merge_asof:

udf2 = df2.drop_duplicates().sort_values('Amount')
res = pd.merge_asof(udf2, df1.sort_values('Amount').assign(indicator=1), on='Amount', by=['Date', 'BillNo.'],
                    direction='nearest', tolerance=5)
res = res.dropna().drop('indicator', 1)

print(res)

Output

         Date    BillNo.  Amount
2  10/08/2020  ABBCSQ1ZA     876
3  10/16/2020  AA171E1Z0    5491
like image 20
Dani Mesejo Avatar answered Sep 18 '22 20:09

Dani Mesejo


We can set Date and BillNo. as index as subtract both the dataframe and filter out only values b/w -5 to 5.

d1 = df1.set_index(['Date', 'BillNo.'])
d2 = df2.set_index(['Date', 'BillNo.'])

idx = (d1-d2).query('Amount>=-5 & Amount<=5').index

d1.loc[idx].reset_index()
         Date    BillNo.  Amount
0  10/08/2020  ABBCSQ1ZA     878
1  10/16/2020  AA171E1Z0    5490

d2.loc[idx].reset_index()
         Date    BillNo.  Amount
0  10/08/2020  ABBCSQ1ZA     876
1  10/16/2020  AA171E1Z0    5491

To make it more generic to work with any n.

n = 5
idx = (d1-d2).query('Amount>=-@n & Amount<=@n').index

Or

lower_limit = -2 # Example, can be anything
upper_limit = 5  # Example, can be anything
idx = (d1-d2).query('Amount>=@lower_limit & Amount<=@upper_limit').index
like image 22
Ch3steR Avatar answered Sep 18 '22 20:09

Ch3steR