Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter Pandas dataframe based on combination of two columns

Tags:

python

pandas

Suppose I have a dataframe as,

   a  b
0  1  2
1  2  3
2  4  2
3  4  3

I want to filter the dataframe such that I get the result as,

   a  b
0  1  2
3  4  3

i.e, I want the combination (1,2) and (4,3) by filtering the two columns together.

If I try this,

df1 = df[df['a'].isin([1,4]) & df['b'].isin([2,3])]

I get the entire dataframe back because combinations of (1,3) and (4,2) also gets included in the above method. But I need only the given combinations. I have a huge list of tuples of two columns based on which I want to filter the dataframe considering the corresponding tuple combination.

Also, I dont want to merge the two columns together as a single string and then filter.

like image 765
mayank agrawal Avatar asked Dec 27 '18 13:12

mayank agrawal


People also ask

How do I get two column combinations in pandas?

By use + operator simply you can combine/merge two or multiple text/string columns in pandas DataFrame. Note that when you apply + operator on numeric columns it actually does addition instead of concatenation.

Can you sort by two columns in pandas?

You can sort pandas DataFrame by one or multiple (one or more) columns using sort_values() method and by ascending or descending order.

How do I filter a DataFrame based on column values in R?

Column values can be subjected to constraints to filter and subset the data. The conditions can be combined by logical & or | operators. The %in% operator is used here, in order to check values that match to any of the values within a specified vector.

Can we group by two columns in DataFrame?

Grouping by Multiple ColumnsYou can do this by passing a list of column names to groupby instead of a single string value.


2 Answers

Use -

df[df[['a', 'b']].apply(tuple, axis=1).isin([(1,2), (4,3)])]

Output

    a   b
0   1   2
3   4   3

Explanation

df[['a', 'b']].apply(tuple, axis=1) gives a series of tuples -

0    (1, 2)
1    (2, 3)
2    (4, 2)
3    (4, 3)

.isin([(1,2), (4,3)]) searches for the desired tuples and gives a boolean series

like image 67
Vivek Kalyanarangan Avatar answered Sep 18 '22 07:09

Vivek Kalyanarangan


The tuple comparison approach as outlined by @Vivek Kalyanarangan is the way to go but the speed can be significantly increased in case of large dataframes by utilizing the MultiIndex instead of using an apply function for tuple creation:

For example, in your case:

keep_tuples = [(1,2), (4,3)]
tuples_in_df = pd.MultiIndex.from_frame(df[["a","b"]])
df[tuples_in_df.isin(keep_tuples)]

This leads to ~5X speed improvement on a 1,000,000 X 2 sized df when compared to using apply function.

like image 39
Md Imbesat Hassan Rizvi Avatar answered Sep 22 '22 07:09

Md Imbesat Hassan Rizvi