Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does pandas dataframe merge work with greater or less?

Tags:

Now i need to merge two dataframe with the condition greater than(>=). But merge only support equal. Is there any way to deal with it? Thanks!

like image 215
J.Bao Avatar asked Mar 28 '17 01:03

J.Bao


People also ask

How does merging work in pandas?

The pd. merge() function recognizes that each DataFrame has an "employee" column, and automatically joins using this column as a key. The result of the merge is a new DataFrame that combines the information from the two inputs.

Is pandas merge case sensitive?

pandas. DataFrame. merge (similar to a SQL join) is case sensitive, as are most Python functions.

Is pandas merge efficient?

A merge is also just as efficient as a join as long as: Merging is done on indexes if possible. The “on” parameter is avoided, and instead, both columns to merge on are explicitly stated using the keywords left_on, left_index, right_on, and right_index (when applicable).

What is difference between pandas concat and merge?

merge() for combining data on common columns or indices. . join() for combining data on a key column or an index. concat() for combining DataFrames across rows or columns.


1 Answers

I don't know how to achieve the following with similar merge and join syntax in pandas,

SELECT * 
FROM a 
INNER JOIN b 
ON a.column1 >= b.column1 AND a.column1 <= b.column2 

But the query above can also be written implicitly as;

SELECT * 
FROM a, b 
WHERE a.column1 >= b.column1 AND a.column1 <= b.column2 

Which is basically the old syntax and should do exactly same (performance wise). It takes the cartesian product of 2 tables (or cross join) and then select from that using the WHERE condition, which could be easily implemented in pandas. This could be a little heavy on memory, but should be fast.

First the FROM a, b clause (we temporarily assign a column with same values in all rows, so we can cross join over it);

df = pd.merge(a.assign(key=0), b.assign(key=0), on='key').drop('key', axis=1)

and then use boolean indexing (our WHERE clause) to slice the frame;

df[(df["column1_x"] >= df["column1_y"]) & (df["column1_x"] <= df["column2_y"])]

If you don't want the cartesian product and only want to compare the rows on same index of both tables, you can merge on index like this;

df = a.merge(b, left_index = True, right_index = True)

or concat on axis 1 if they are same length;

df = pd.concat([a, b], axis=1)

And use boolean indexing again to eliminate results;

df[(df["column1_x"] >= df["column1_y"]) & (df["column1_x"] <= df["column2_y"])]
like image 198
umutto Avatar answered Sep 24 '22 10:09

umutto