Suppose I have a dataframe as,
a b
0 1 2
1 2 3
2 4 2
3 4 3
I want to filter the dataframe such that I get the result as,
a b
0 1 2
3 4 3
i.e, I want the combination (1,2)
and (4,3)
by filtering the two columns together.
If I try this,
df1 = df[df['a'].isin([1,4]) & df['b'].isin([2,3])]
I get the entire dataframe back because combinations of (1,3)
and (4,2)
also gets included in the above method. But I need only the given combinations. I have a huge list of tuples of two columns based on which I want to filter the dataframe considering the corresponding tuple combination.
Also, I dont want to merge the two columns together as a single string and then filter.
By use + operator simply you can combine/merge two or multiple text/string columns in pandas DataFrame. Note that when you apply + operator on numeric columns it actually does addition instead of concatenation.
You can sort pandas DataFrame by one or multiple (one or more) columns using sort_values() method and by ascending or descending order.
Column values can be subjected to constraints to filter and subset the data. The conditions can be combined by logical & or | operators. The %in% operator is used here, in order to check values that match to any of the values within a specified vector.
Grouping by Multiple ColumnsYou can do this by passing a list of column names to groupby instead of a single string value.
Use -
df[df[['a', 'b']].apply(tuple, axis=1).isin([(1,2), (4,3)])]
Output
a b
0 1 2
3 4 3
Explanation
df[['a', 'b']].apply(tuple, axis=1)
gives a series of tuples -
0 (1, 2)
1 (2, 3)
2 (4, 2)
3 (4, 3)
.isin([(1,2), (4,3)])
searches for the desired tuples and gives a boolean series
The tuple comparison approach as outlined by @Vivek Kalyanarangan is the way to go but the speed can be significantly increased in case of large dataframes by utilizing the MultiIndex instead of using an apply function for tuple creation:
For example, in your case:
keep_tuples = [(1,2), (4,3)]
tuples_in_df = pd.MultiIndex.from_frame(df[["a","b"]])
df[tuples_in_df.isin(keep_tuples)]
This leads to ~5X speed improvement on a 1,000,000 X 2 sized df when compared to using apply function.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With