how do you filter pandas dataframes by multiple columns

People also ask

How do you sort data frames by multiple columns?

You can sort pandas DataFrame by one or multiple (one or more) columns using sort_values() method and by ascending or descending order. To specify the order, you have to use ascending boolean property; False for descending and True for ascending.

How do you filter a DataFrame in multiple conditions?

Using Loc to Filter With Multiple Conditions The loc function in pandas can be used to access groups of rows or columns by label. Add each condition you want to be included in the filtered result and concatenate them with the & operator. You'll see our code sample will return a pd. dataframe of our filtered rows.

Can you Groupby multiple columns in pandas?

Pandas comes with a whole host of sql-like aggregation functions you can apply when grouping on one or more columns. This is Python's closest equivalent to dplyr's group_by + summarise logic.

Using & operator, don't forget to wrap the sub-statements with ():

males = df[(df[Gender]=='Male') & (df[Year]==2014)]

To store your dataframes in a dict using a for loop:

from collections import defaultdict
dic={}
for g in ['male', 'female']:
  dic[g]=defaultdict(dict)
  for y in [2013, 2014]:
    dic[g][y]=df[(df[Gender]==g) & (df[Year]==y)] #store the DataFrames to a dict of dict

EDIT:

A demo for your getDF:

def getDF(dic, gender, year):
  return dic[gender][year]

print genDF(dic, 'male', 2014)

For more general boolean functions that you would like to use as a filter and that depend on more than one column, you can use:

df = df[df[['col_1','col_2']].apply(lambda x: f(*x), axis=1)]

where f is a function that is applied to every pair of elements (x1, x2) from col_1 and col_2 and returns True or False depending on any condition you want on (x1, x2).

Start from pandas 0.13, this is the most efficient way.

df.query('Gender=="Male" & Year=="2014" ')

In case somebody wonders what is the faster way to filter (the accepted answer or the one from @redreamality):

import pandas as pd
import numpy as np

length = 100_000
df = pd.DataFrame()
df['Year'] = np.random.randint(1950, 2019, size=length)
df['Gender'] = np.random.choice(['Male', 'Female'], length)

%timeit df.query('Gender=="Male" & Year=="2014" ')
%timeit df[(df['Gender']=='Male') & (df['Year']==2014)]

Results for 100,000 rows:

6.67 ms ± 557 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.54 ms ± 536 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Results for 10,000,000 rows:

326 ms ± 6.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
472 ms ± 25.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So results depend on the size and the data. On my laptop, query() gets faster after 500k rows. Further, the string search in Year=="2014" has an unnecessary overhead (Year==2014 is faster).

You can create your own filter function using query in pandas. Here you have filtering of df results by all the kwargs parameters. Dont' forgot to add some validators(kwargs filtering) to get filter function for your own df.

def filter(df, **kwargs):
    query_list = []
    for key in kwargs.keys():
        query_list.append(f'{key}=="{kwargs[key]}"')
    query = ' & '.join(query_list)
    return df.query(query)

Related questions
                            
                                Installing Python 3 on RHEL
                            
                                How to update Python?
                            
                                Best way to create a simple python web service [closed]
                            
                                Why does the expression 0 < 0 == 0 return False in Python?
                            
                                Suppress/ print without b' prefix for bytes in Python 3
                            
                                How can I profile Python code line-by-line?
                            
                                What is a None value?
                            
                                Complex numbers in python
                            
                                What is the difference between class and instance attributes?
                            
                                Iterating Over Dictionary Key Values Corresponding to List in Python
                            
                                How should I read a file line-by-line in Python?
                            
                                How to extract the year from a Python datetime object?
                            
                                How to "select distinct" across multiple data frame columns in pandas?
                            
                                Very Long If Statement in Python [duplicate]
                            
                                How to set class attribute with await in __init__
                            
                                unbound method f() must be called with fibo_ instance as first argument (got classobj instance instead)
                            
                                Binning a column with Python Pandas
                            
                                Given a URL to a text file, what is the simplest way to read the contents of the text file?
                            
                                Split string using a newline delimiter with Python
                            
                                [] and {} vs list() and dict(), which is better?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how do you filter pandas dataframes by multiple columns

Tags:

python

pandas

filter

People also ask

EDIT:

Recent Activity

Donate For Us