Efficient way to apply multiple filters to pandas DataFrame or Series

People also ask

What is the most efficient way to loop through Dataframes with pandas?

Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.

Can DataFrame contain multiple series?

You can create a DataFrame from multiple Series objects by adding each series as a columns. By using concat() method you can merge multiple series together into DataFrame.

Pandas (and numpy) allow for boolean indexing, which will be much more efficient:

In [11]: df.loc[df['col1'] >= 1, 'col1']
Out[11]: 
1    1
2    2
Name: col1

In [12]: df[df['col1'] >= 1]
Out[12]: 
   col1  col2
1     1    11
2     2    12

In [13]: df[(df['col1'] >= 1) & (df['col1'] <=1 )]
Out[13]: 
   col1  col2
1     1    11

If you want to write helper functions for this, consider something along these lines:

In [14]: def b(x, col, op, n): 
             return op(x[col],n)

In [15]: def f(x, *b):
             return x[(np.logical_and(*b))]

In [16]: b1 = b(df, 'col1', ge, 1)

In [17]: b2 = b(df, 'col1', le, 1)

In [18]: f(df, b1, b2)
Out[18]: 
   col1  col2
1     1    11

Update: pandas 0.13 has a query method for these kind of use cases, assuming column names are valid identifiers the following works (and can be more efficient for large frames as it uses numexpr behind the scenes):

In [21]: df.query('col1 <= 1 & 1 <= col1')
Out[21]:
   col1  col2
1     1    11

Chaining conditions creates long lines, which are discouraged by pep8. Using the .query method forces to use strings, which is powerful but unpythonic and not very dynamic.

Once each of the filters is in place, one approach is

import numpy as np
import functools
def conjunction(*conditions):
    return functools.reduce(np.logical_and, conditions)

c_1 = data.col1 == True
c_2 = data.col2 < 64
c_3 = data.col3 != 4

data_filtered = data[conjunction(c1,c2,c3)]

np.logical operates on and is fast, but does not take more than two arguments, which is handled by functools.reduce.

Note that this still has some redundancies: a) shortcutting does not happen on a global level b) Each of the individual conditions runs on the whole initial data. Still, I expect this to be efficient enough for many applications and it is very readable.

You can also make a disjunction (wherein only one of the conditions needs to be true) by using np.logical_or instead:

import numpy as np
import functools
def disjunction(*conditions):
    return functools.reduce(np.logical_or, conditions)

c_1 = data.col1 == True
c_2 = data.col2 < 64
c_3 = data.col3 != 4

data_filtered = data[disjunction(c_1,c_2,c_3)]

Simplest of All Solutions:

Use:

filtered_df = df[(df['col1'] >= 1) & (df['col1'] <= 5)]

Another Example, To filter the dataframe for values belonging to Feb-2018, use the below code

filtered_df = df[(df['year'] == 2018) & (df['month'] == 2)]

Since pandas 0.22 update, comparison options are available like:

gt (greater than)
lt (less than)
eq (equals to)
ne (not equals to)
ge (greater than or equals to)

and many more. These functions return boolean array. Let's see how we can use them:

# sample data
df = pd.DataFrame({'col1': [0, 1, 2,3,4,5], 'col2': [10, 11, 12,13,14,15]})

# get values from col1 greater than or equals to 1
df.loc[df['col1'].ge(1),'col1']

1    1
2    2
3    3
4    4
5    5

# where co11 values is between 0 and 2
df.loc[df['col1'].between(0,2)]

 col1 col2
0   0   10
1   1   11
2   2   12

# where col1 > 1
df.loc[df['col1'].gt(1)]

 col1 col2
2   2   12
3   3   13
4   4   14
5   5   15

Why not do this?

def filt_spec(df, col, val, op):
    import operator
    ops = {'eq': operator.eq, 'neq': operator.ne, 'gt': operator.gt, 'ge': operator.ge, 'lt': operator.lt, 'le': operator.le}
    return df[ops[op](df[col], val)]
pandas.DataFrame.filt_spec = filt_spec

Demo:

df = pd.DataFrame({'a': [1,2,3,4,5], 'b':[5,4,3,2,1]})
df.filt_spec('a', 2, 'ge')

Result:

You can see that column 'a' has been filtered where a >=2.

This is slightly faster (typing time, not performance) than operator chaining. You could of course put the import at the top of the file.

e can also select rows based on values of a column that are not in a list or any iterable. We will create boolean variable just like before, but now we will negate the boolean variable by placing ~ in the front.

For example

list = [1, 0]
df[df.col1.isin(list)]

If you want to check any/all of multiple columns for a value, you can do:

df[(df[['HomeTeam', 'AwayTeam']] == 'Fulham').any(axis=1)]

Related questions
                            
                                Trying to mock datetime.date.today(), but not working
                            
                                Django - makemigrations - No changes detected
                            
                                How to remove frame from matplotlib (pyplot.figure vs matplotlib.figure ) (frameon=False Problematic in matplotlib)
                            
                                How to implement an ordered, default dict?
                            
                                Python memory leaks [closed]
                            
                                How can I improve my paw detection?
                            
                                Stop pip from failing on single package when installing with requirements.txt
                            
                                Extract a part of the filepath (a directory) in Python
                            
                                Concatenate a list of pandas dataframes together
                            
                                How to add multiple columns to pandas dataframe in one assignment?
                            
                                How to make a Python script run like a service or daemon in Linux
                            
                                Anaconda export Environment file
                            
                                Running a specific test case in Django when your app has a tests directory
                            
                                What do ellipsis [...] mean in a list?
                            
                                Removing all non-numeric characters from string in Python
                            
                                pytest cannot import module while python can
                            
                                Immutable vs Mutable types
                            
                                python: Change the scripts working directory to the script's own directory
                            
                                How do I override __getattr__ in Python without breaking the default behavior?
                            
                                How to add hovering annotations in matplotlib

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficient way to apply multiple filters to pandas DataFrame or Series

Tags:

python

algorithm

pandas

People also ask

Recent Activity

Donate For Us