Best way to join / merge by range in pandas

Q: Is pandas merge efficient?

Merge can be used in cases where both the left and right columns are not unique, and therefore cannot be an index. A merge is also just as efficient as a join as long as: Merging is done on indexes if possible.

Q: What is the difference between merge join and concatenate in pandas?

The main difference between merge & concat is that merge allow you to perform more structured "join" of tables where use of concat is more broad and less structured.

Q: Can you merge more than 2 DataFrames in pandas?

We can use either pandas. merge() or DataFrame. merge() to merge multiple Dataframes. Merging multiple Dataframes is similar to SQL join and supports different types of join inner , left , right , outer , cross .

Tags:

python

join

pandas

numpy

I'm frequently using pandas for merge (join) by using a range condition.

For instance if there are 2 dataframes:

A (A_id, A_value)

B (B_id,B_low, B_high, B_name)

which are big and approximately of the same size (let's say 2M records each).

I would like to make an inner join between A and B, so A_value would be between B_low and B_high.

Using SQL syntax that would be:

SELECT * FROM A,B WHERE A_value between B_low and B_high

and that would be really easy, short and efficient.

Meanwhile in pandas the only way (that's not using loops that I found), is by creating a dummy column in both tables, join on it (equivalent to cross-join) and then filter out unneeded rows. That sounds heavy and complex:

A['dummy'] = 1 B['dummy'] = 1 Temp = pd.merge(A,B,on='dummy') Result = Temp[Temp.A_value.between(Temp.B_low,Temp.B_high)]

Another solution that I had is by applying on each of A value a search function on B by usingB[(x>=B.B_low) & (x<=B.B_high)] mask, but it sounds inefficient as well and might require index optimization.

Is there a more elegant and/or efficient way to perform this action?

359

asked Jun 05 '17 11:06

Dimgold

1 Answers

Setup
Consider the dataframes A and B

A = pd.DataFrame(dict(         A_id=range(10),         A_value=range(5, 105, 10)     )) B = pd.DataFrame(dict(         B_id=range(5),         B_low=[0, 30, 30, 46, 84],         B_high=[10, 40, 50, 54, 84]     ))  A     A_id  A_value 0     0        5 1     1       15 2     2       25 3     3       35 4     4       45 5     5       55 6     6       65 7     7       75 8     8       85 9     9       95  B     B_high  B_id  B_low 0      10     0      0 1      40     1     30 2      50     2     30 3      54     3     46 4      84     4     84

numpy
The ✌easiest✌ way is to use numpy broadcasting.
We look for every instance of A_value being greater than or equal to B_low while at the same time A_value is less than or equal to B_high.

a = A.A_value.values bh = B.B_high.values bl = B.B_low.values  i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))  pd.concat([     A.loc[i, :].reset_index(drop=True),     B.loc[j, :].reset_index(drop=True) ], axis=1)     A_id  A_value  B_high  B_id  B_low 0     0        5      10     0      0 1     3       35      40     1     30 2     3       35      50     2     30 3     4       45      50     2     30

To address the comments and give something akin to a left join, I appended the part of A that doesn't match.

pd.concat([     A.loc[i, :].reset_index(drop=True),     B.loc[j, :].reset_index(drop=True) ], axis=1).append(     A[~np.in1d(np.arange(len(A)), np.unique(i))],     ignore_index=True, sort=False )      A_id  A_value  B_id  B_low  B_high 0      0        5   0.0    0.0    10.0 1      3       35   1.0   30.0    40.0 2      3       35   2.0   30.0    50.0 3      4       45   2.0   30.0    50.0 4      1       15   NaN    NaN     NaN 5      2       25   NaN    NaN     NaN 6      5       55   NaN    NaN     NaN 7      6       65   NaN    NaN     NaN 8      7       75   NaN    NaN     NaN 9      8       85   NaN    NaN     NaN 10     9       95   NaN    NaN     NaN

170

answered Oct 08 '22 18:10

piRSquared

Related questions
                            
                                Extending builtin classes in python
                            
                                Django unit testing with date/time-based objects
                            
                                how do I determine whether a python script is imported as module or run as script?
                            
                                Force my scrapy spider to stop crawling
                            
                                Use datetime.strftime() on years before 1900? ("require year >= 1900")
                            
                                Splitting a string by list of indices
                            
                                How to pass on argparse argument to function as kwargs?
                            
                                Adding Macros to Python
                            
                                Python: block network connections for testing purposes?
                            
                                Size of figure when using plt.subplots
                            
                                Add dropout layers between pretrained dense layers in keras
                            
                                Python regex split without empty string
                            
                                Reversal of string.contains In python, pandas
                            
                                set of list of lists in python
                            
                                Different ways of deleting lists
                            
                                Accessing POST Data from WSGI
                            
                                How to parse packets in a python library? [closed]
                            
                                setup.py: restrict the allowable version of the python interpreter
                            
                                Manually trigger Django email error report
                            
                                Why does a query invoke a auto-flush in SQLAlchemy?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With