I'm frequently using pandas for merge (join) by using a range condition.
For instance if there are 2 dataframes:
A (A_id, A_value)
B (B_id,B_low, B_high, B_name)
which are big and approximately of the same size (let's say 2M records each).
I would like to make an inner join between A and B, so A_value would be between B_low and B_high.
Using SQL syntax that would be:
SELECT * FROM A,B WHERE A_value between B_low and B_high
and that would be really easy, short and efficient.
Meanwhile in pandas the only way (that's not using loops that I found), is by creating a dummy column in both tables, join on it (equivalent to cross-join) and then filter out unneeded rows. That sounds heavy and complex:
A['dummy'] = 1 B['dummy'] = 1 Temp = pd.merge(A,B,on='dummy') Result = Temp[Temp.A_value.between(Temp.B_low,Temp.B_high)]
Another solution that I had is by applying on each of A value a search function on B by usingB[(x>=B.B_low) & (x<=B.B_high)]
mask, but it sounds inefficient as well and might require index optimization.
Is there a more elegant and/or efficient way to perform this action?
Merge can be used in cases where both the left and right columns are not unique, and therefore cannot be an index. A merge is also just as efficient as a join as long as: Merging is done on indexes if possible.
The main difference between merge & concat is that merge allow you to perform more structured "join" of tables where use of concat is more broad and less structured.
We can use either pandas. merge() or DataFrame. merge() to merge multiple Dataframes. Merging multiple Dataframes is similar to SQL join and supports different types of join inner , left , right , outer , cross .
Setup
Consider the dataframes A
and B
A = pd.DataFrame(dict( A_id=range(10), A_value=range(5, 105, 10) )) B = pd.DataFrame(dict( B_id=range(5), B_low=[0, 30, 30, 46, 84], B_high=[10, 40, 50, 54, 84] )) A A_id A_value 0 0 5 1 1 15 2 2 25 3 3 35 4 4 45 5 5 55 6 6 65 7 7 75 8 8 85 9 9 95 B B_high B_id B_low 0 10 0 0 1 40 1 30 2 50 2 30 3 54 3 46 4 84 4 84
numpy
The ✌easiest✌ way is to use numpy
broadcasting.
We look for every instance of A_value
being greater than or equal to B_low
while at the same time A_value
is less than or equal to B_high
.
a = A.A_value.values bh = B.B_high.values bl = B.B_low.values i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh)) pd.concat([ A.loc[i, :].reset_index(drop=True), B.loc[j, :].reset_index(drop=True) ], axis=1) A_id A_value B_high B_id B_low 0 0 5 10 0 0 1 3 35 40 1 30 2 3 35 50 2 30 3 4 45 50 2 30
To address the comments and give something akin to a left join, I appended the part of A
that doesn't match.
pd.concat([ A.loc[i, :].reset_index(drop=True), B.loc[j, :].reset_index(drop=True) ], axis=1).append( A[~np.in1d(np.arange(len(A)), np.unique(i))], ignore_index=True, sort=False ) A_id A_value B_id B_low B_high 0 0 5 0.0 0.0 10.0 1 3 35 1.0 30.0 40.0 2 3 35 2.0 30.0 50.0 3 4 45 2.0 30.0 50.0 4 1 15 NaN NaN NaN 5 2 25 NaN NaN NaN 6 5 55 NaN NaN NaN 7 6 65 NaN NaN NaN 8 7 75 NaN NaN NaN 9 8 85 NaN NaN NaN 10 9 95 NaN NaN NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With