Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to join / merge by range in pandas

I'm frequently using pandas for merge (join) by using a range condition.

For instance if there are 2 dataframes:

A (A_id, A_value)

B (B_id,B_low, B_high, B_name)

which are big and approximately of the same size (let's say 2M records each).

I would like to make an inner join between A and B, so A_value would be between B_low and B_high.

Using SQL syntax that would be:

SELECT * FROM A,B WHERE A_value between B_low and B_high 

and that would be really easy, short and efficient.

Meanwhile in pandas the only way (that's not using loops that I found), is by creating a dummy column in both tables, join on it (equivalent to cross-join) and then filter out unneeded rows. That sounds heavy and complex:

A['dummy'] = 1 B['dummy'] = 1 Temp = pd.merge(A,B,on='dummy') Result = Temp[Temp.A_value.between(Temp.B_low,Temp.B_high)] 

Another solution that I had is by applying on each of A value a search function on B by usingB[(x>=B.B_low) & (x<=B.B_high)] mask, but it sounds inefficient as well and might require index optimization.

Is there a more elegant and/or efficient way to perform this action?

like image 359
Dimgold Avatar asked Jun 05 '17 11:06

Dimgold


People also ask

Is pandas merge efficient?

Merge can be used in cases where both the left and right columns are not unique, and therefore cannot be an index. A merge is also just as efficient as a join as long as: Merging is done on indexes if possible.

What is the difference between merge join and concatenate in pandas?

The main difference between merge & concat is that merge allow you to perform more structured "join" of tables where use of concat is more broad and less structured.

Can you merge more than 2 DataFrames in pandas?

We can use either pandas. merge() or DataFrame. merge() to merge multiple Dataframes. Merging multiple Dataframes is similar to SQL join and supports different types of join inner , left , right , outer , cross .


1 Answers

Setup
Consider the dataframes A and B

A = pd.DataFrame(dict(         A_id=range(10),         A_value=range(5, 105, 10)     )) B = pd.DataFrame(dict(         B_id=range(5),         B_low=[0, 30, 30, 46, 84],         B_high=[10, 40, 50, 54, 84]     ))  A     A_id  A_value 0     0        5 1     1       15 2     2       25 3     3       35 4     4       45 5     5       55 6     6       65 7     7       75 8     8       85 9     9       95  B     B_high  B_id  B_low 0      10     0      0 1      40     1     30 2      50     2     30 3      54     3     46 4      84     4     84 

numpy
The ✌easiest✌ way is to use numpy broadcasting.
We look for every instance of A_value being greater than or equal to B_low while at the same time A_value is less than or equal to B_high.

a = A.A_value.values bh = B.B_high.values bl = B.B_low.values  i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))  pd.concat([     A.loc[i, :].reset_index(drop=True),     B.loc[j, :].reset_index(drop=True) ], axis=1)     A_id  A_value  B_high  B_id  B_low 0     0        5      10     0      0 1     3       35      40     1     30 2     3       35      50     2     30 3     4       45      50     2     30 

To address the comments and give something akin to a left join, I appended the part of A that doesn't match.

pd.concat([     A.loc[i, :].reset_index(drop=True),     B.loc[j, :].reset_index(drop=True) ], axis=1).append(     A[~np.in1d(np.arange(len(A)), np.unique(i))],     ignore_index=True, sort=False )      A_id  A_value  B_id  B_low  B_high 0      0        5   0.0    0.0    10.0 1      3       35   1.0   30.0    40.0 2      3       35   2.0   30.0    50.0 3      4       45   2.0   30.0    50.0 4      1       15   NaN    NaN     NaN 5      2       25   NaN    NaN     NaN 6      5       55   NaN    NaN     NaN 7      6       65   NaN    NaN     NaN 8      7       75   NaN    NaN     NaN 9      8       85   NaN    NaN     NaN 10     9       95   NaN    NaN     NaN 
like image 170
piRSquared Avatar answered Oct 08 '22 18:10

piRSquared