Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.
You can create a DataFrame from multiple Series objects by adding each series as a columns. By using concat() method you can merge multiple series together into DataFrame.
Pandas (and numpy) allow for boolean indexing, which will be much more efficient:
In [11]: df.loc[df['col1'] >= 1, 'col1']
Out[11]:
1 1
2 2
Name: col1
In [12]: df[df['col1'] >= 1]
Out[12]:
col1 col2
1 1 11
2 2 12
In [13]: df[(df['col1'] >= 1) & (df['col1'] <=1 )]
Out[13]:
col1 col2
1 1 11
If you want to write helper functions for this, consider something along these lines:
In [14]: def b(x, col, op, n):
return op(x[col],n)
In [15]: def f(x, *b):
return x[(np.logical_and(*b))]
In [16]: b1 = b(df, 'col1', ge, 1)
In [17]: b2 = b(df, 'col1', le, 1)
In [18]: f(df, b1, b2)
Out[18]:
col1 col2
1 1 11
Update: pandas 0.13 has a query method for these kind of use cases, assuming column names are valid identifiers the following works (and can be more efficient for large frames as it uses numexpr behind the scenes):
In [21]: df.query('col1 <= 1 & 1 <= col1')
Out[21]:
col1 col2
1 1 11
Chaining conditions creates long lines, which are discouraged by pep8. Using the .query method forces to use strings, which is powerful but unpythonic and not very dynamic.
Once each of the filters is in place, one approach is
import numpy as np
import functools
def conjunction(*conditions):
return functools.reduce(np.logical_and, conditions)
c_1 = data.col1 == True
c_2 = data.col2 < 64
c_3 = data.col3 != 4
data_filtered = data[conjunction(c1,c2,c3)]
np.logical operates on and is fast, but does not take more than two arguments, which is handled by functools.reduce.
Note that this still has some redundancies: a) shortcutting does not happen on a global level b) Each of the individual conditions runs on the whole initial data. Still, I expect this to be efficient enough for many applications and it is very readable.
You can also make a disjunction (wherein only one of the conditions needs to be true) by using np.logical_or
instead:
import numpy as np
import functools
def disjunction(*conditions):
return functools.reduce(np.logical_or, conditions)
c_1 = data.col1 == True
c_2 = data.col2 < 64
c_3 = data.col3 != 4
data_filtered = data[disjunction(c_1,c_2,c_3)]
Simplest of All Solutions:
Use:
filtered_df = df[(df['col1'] >= 1) & (df['col1'] <= 5)]
Another Example, To filter the dataframe for values belonging to Feb-2018, use the below code
filtered_df = df[(df['year'] == 2018) & (df['month'] == 2)]
Since pandas 0.22 update, comparison options are available like:
and many more. These functions return boolean array. Let's see how we can use them:
# sample data
df = pd.DataFrame({'col1': [0, 1, 2,3,4,5], 'col2': [10, 11, 12,13,14,15]})
# get values from col1 greater than or equals to 1
df.loc[df['col1'].ge(1),'col1']
1 1
2 2
3 3
4 4
5 5
# where co11 values is between 0 and 2
df.loc[df['col1'].between(0,2)]
col1 col2
0 0 10
1 1 11
2 2 12
# where col1 > 1
df.loc[df['col1'].gt(1)]
col1 col2
2 2 12
3 3 13
4 4 14
5 5 15
Why not do this?
def filt_spec(df, col, val, op):
import operator
ops = {'eq': operator.eq, 'neq': operator.ne, 'gt': operator.gt, 'ge': operator.ge, 'lt': operator.lt, 'le': operator.le}
return df[ops[op](df[col], val)]
pandas.DataFrame.filt_spec = filt_spec
Demo:
df = pd.DataFrame({'a': [1,2,3,4,5], 'b':[5,4,3,2,1]})
df.filt_spec('a', 2, 'ge')
Result:
a b
1 2 4
2 3 3
3 4 2
4 5 1
You can see that column 'a' has been filtered where a >=2.
This is slightly faster (typing time, not performance) than operator chaining. You could of course put the import at the top of the file.
e can also select rows based on values of a column that are not in a list or any iterable. We will create boolean variable just like before, but now we will negate the boolean variable by placing ~ in the front.
For example
list = [1, 0]
df[df.col1.isin(list)]
If you want to check any/all of multiple columns for a value, you can do:
df[(df[['HomeTeam', 'AwayTeam']] == 'Fulham').any(axis=1)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With