I have two data frames df1
and df2
as shown below:
df1
Date BillNo. Amount
10/08/2020 ABBCSQ1ZA 878
10/09/2020 AADC9C1Z5 11
10/12/2020 AC928Q1ZS 3998
10/14/2020 AC9268RE3 198
10/16/2020 AA171E1Z0 5490
10/19/2020 BU073C1ZW 3432
df2
Date BillNo. Amount
10/08/2020 ABBCSQ1ZA 876
10/11/2020 ATRC95REW 115
10/14/2020 AC9268RE3 212
10/16/2020 AA171E1Z0 5491
10/25/2020 BPO66W2LO 344
My final answer should be:
final
Date BillNo. Amount
10/08/2020 ABBCSQ1ZA 876
10/16/2020 AA171E1Z0 5491
How do I find common rows from both the data frame using Date BillNo. Amount
when the difference in value range is between [-5,5]?
I know how to find common rows by using:
df_all = df1.merge(df2.drop_duplicates(), on=['Date', 'BillNo.', 'Amount'],
how='outer', indicator=True)
However, this doesn't give the rows which are in range. Anyone who could help?
Edit: We can see in df1: 10/14/2020,AC9268RE3,198
and df2: 10/14/2020,AC9268RE3,212
the difference is 14, hence this should not be included in common rows
DataFrame join () method doesn’t support joining two DataFrames on columns as join () is used for indices. However, you can convert column to index and used it on join. The best approach would be using merge () method when you wanted to join on columns.
Let’s merge the two data frames with different columns. It is possible to join the different columns is using concat () method. Syntax: pandas.concat (objs: Union [Iterable [‘DataFrame’], Mapping [Label, ‘DataFrame’]], axis=’0′, join: str = “‘outer'”)
If need add more column, nicer and better is join (not necessary delete column, left join by default), but if need add only one column map is faster. join utilizes the index to merge on unless we specify a column to use instead. However, we can only specify a column instead of the index for the 'left' dataframe.
However, we can only specify a column instead of the index for the 'left' dataframe. use join with df as the left dataframe and id as the on parameter. Note that I could have set_index ('id') on df to avoid having to use the on parameter.
We can merge, then perform a query to drop rows not within the range:
(df1.merge(df2, on=['Date', 'BillNo.'])
.query('abs(Amount_x - Amount_y) <= 5')
.drop('Amount_x', axis=1))
Date BillNo. Amount_y
0 10/08/2020 ABBCSQ1ZA 876
1 10/16/2020 AA171E1Z0 5491
This works well as long as there is only one row that corresponds to a specific (Date, BillNo) combination in each frame.
You could use merge_asof:
udf2 = df2.drop_duplicates().sort_values('Amount')
res = pd.merge_asof(udf2, df1.sort_values('Amount').assign(indicator=1), on='Amount', by=['Date', 'BillNo.'],
direction='nearest', tolerance=5)
res = res.dropna().drop('indicator', 1)
print(res)
Output
Date BillNo. Amount
2 10/08/2020 ABBCSQ1ZA 876
3 10/16/2020 AA171E1Z0 5491
We can set Date
and BillNo.
as index as subtract both the dataframe and filter out only values b/w -5 to 5.
d1 = df1.set_index(['Date', 'BillNo.'])
d2 = df2.set_index(['Date', 'BillNo.'])
idx = (d1-d2).query('Amount>=-5 & Amount<=5').index
d1.loc[idx].reset_index()
Date BillNo. Amount
0 10/08/2020 ABBCSQ1ZA 878
1 10/16/2020 AA171E1Z0 5490
d2.loc[idx].reset_index()
Date BillNo. Amount
0 10/08/2020 ABBCSQ1ZA 876
1 10/16/2020 AA171E1Z0 5491
To make it more generic to work with any n.
n = 5
idx = (d1-d2).query('Amount>=-@n & Amount<=@n').index
Or
lower_limit = -2 # Example, can be anything
upper_limit = 5 # Example, can be anything
idx = (d1-d2).query('Amount>=@lower_limit & Amount<=@upper_limit').index
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With