Join two DataFrames on common columns only if the difference in a separate column is within range [-n, +n]

Tags:

I have two data frames df1 and df2 as shown below:

df1

Date        BillNo.     Amount
10/08/2020  ABBCSQ1ZA   878
10/09/2020  AADC9C1Z5   11
10/12/2020  AC928Q1ZS   3998
10/14/2020  AC9268RE3   198
10/16/2020  AA171E1Z0   5490
10/19/2020  BU073C1ZW   3432

df2

Date        BillNo.     Amount
10/08/2020  ABBCSQ1ZA   876
10/11/2020  ATRC95REW   115
10/14/2020  AC9268RE3   212
10/16/2020  AA171E1Z0   5491
10/25/2020  BPO66W2LO   344

My final answer should be:

final

Date        BillNo.     Amount
10/08/2020  ABBCSQ1ZA   876
10/16/2020  AA171E1Z0   5491

How do I find common rows from both the data frame using Date BillNo. Amount when the difference in value range is between [-5,5]?

I know how to find common rows by using:

df_all = df1.merge(df2.drop_duplicates(), on=['Date', 'BillNo.', 'Amount'], 
                   how='outer', indicator=True)

However, this doesn't give the rows which are in range. Anyone who could help?

Edit: We can see in df1: 10/14/2020,AC9268RE3,198 and df2: 10/14/2020,AC9268RE3,212 the difference is 14, hence this should not be included in common rows

328

asked Dec 26 '20 12:12

Gopal Chitalia

3 Answers

We can merge, then perform a query to drop rows not within the range:

(df1.merge(df2, on=['Date', 'BillNo.'])
    .query('abs(Amount_x - Amount_y) <= 5')
    .drop('Amount_x', axis=1))

         Date    BillNo.  Amount_y
0  10/08/2020  ABBCSQ1ZA       876
1  10/16/2020  AA171E1Z0      5491

This works well as long as there is only one row that corresponds to a specific (Date, BillNo) combination in each frame.

107

answered Sep 20 '22 20:09

cs95

You could use merge_asof:

udf2 = df2.drop_duplicates().sort_values('Amount')
res = pd.merge_asof(udf2, df1.sort_values('Amount').assign(indicator=1), on='Amount', by=['Date', 'BillNo.'],
                    direction='nearest', tolerance=5)
res = res.dropna().drop('indicator', 1)

print(res)

Output

         Date    BillNo.  Amount
2  10/08/2020  ABBCSQ1ZA     876
3  10/16/2020  AA171E1Z0    5491

answered Sep 18 '22 20:09

Dani Mesejo

We can set Date and BillNo. as index as subtract both the dataframe and filter out only values b/w -5 to 5.

d1 = df1.set_index(['Date', 'BillNo.'])
d2 = df2.set_index(['Date', 'BillNo.'])

idx = (d1-d2).query('Amount>=-5 & Amount<=5').index

d1.loc[idx].reset_index()
         Date    BillNo.  Amount
0  10/08/2020  ABBCSQ1ZA     878
1  10/16/2020  AA171E1Z0    5490

d2.loc[idx].reset_index()
         Date    BillNo.  Amount
0  10/08/2020  ABBCSQ1ZA     876
1  10/16/2020  AA171E1Z0    5491

To make it more generic to work with any n.

n = 5
idx = (d1-d2).query('Amount>=-@n & Amount<=@n').index

lower_limit = -2 # Example, can be anything
upper_limit = 5  # Example, can be anything
idx = (d1-d2).query('Amount>=@lower_limit & Amount<=@upper_limit').index

answered Sep 18 '22 20:09

Ch3steR

Related questions
                            
                                Unpack value(s) into variable(s) or None (ValueError: not enough values to unpack) [duplicate]
                            
                                Achieving multiple inheritance using python dataclasses
                            
                                How to throw HTTP error code with AWS Lambda using Lambda Proxy?
                            
                                Python3 : module 'tabula' has no attribute 'read_pdf'
                            
                                How do you model something-over-time in Python?
                            
                                Unable to import pandas (pandas._libs.window.aggregations)
                            
                                Pyenv's python is missing bzip2 module
                            
                                Plotly: Figure window doesn't appear using Spyder
                            
                                Unavailable to install Tensorflow 1.x on Ubuntu 20.04 LTS using pip
                            
                                Renaming months from number to name in pandas
                            
                                What's the best way to parse through a list of strings and return joined strings based on slices of these strings?
                            
                                Google translate api timeout
                            
                                Why PyTorch model takes multiple image size inside the model?
                            
                                How to create a Python 3.8 virtual environment in Ubuntu 16.04
                            
                                How to fix 'numpy.ndarray' object has no attribute 'get_figure' when plotting subplots
                            
                                pip install options unclear
                            
                                how to delete char after -> without using a regular expression
                            
                                How do I get the discord.py intents to work?
                            
                                Windows keeps crashing when trying to install PyTorch via pip
                            
                                ImportError: Can't find framework /System/Library/Frameworks/OpenGL.framework

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Join two DataFrames on common columns only if the difference in a separate column is within range [-n, +n]

Tags:

python

pandas

dataframe