You can sort pandas DataFrame by one or multiple (one or more) columns using sort_values() method and by ascending or descending order. To specify the order, you have to use ascending boolean property; False for descending and True for ascending.
Using Loc to Filter With Multiple Conditions The loc function in pandas can be used to access groups of rows or columns by label. Add each condition you want to be included in the filtered result and concatenate them with the & operator. You'll see our code sample will return a pd. dataframe of our filtered rows.
Pandas comes with a whole host of sql-like aggregation functions you can apply when grouping on one or more columns. This is Python's closest equivalent to dplyr's group_by + summarise logic.
Using &
operator, don't forget to wrap the sub-statements with ()
:
males = df[(df[Gender]=='Male') & (df[Year]==2014)]
To store your dataframes in a dict
using a for loop:
from collections import defaultdict
dic={}
for g in ['male', 'female']:
dic[g]=defaultdict(dict)
for y in [2013, 2014]:
dic[g][y]=df[(df[Gender]==g) & (df[Year]==y)] #store the DataFrames to a dict of dict
A demo for your getDF
:
def getDF(dic, gender, year):
return dic[gender][year]
print genDF(dic, 'male', 2014)
For more general boolean functions that you would like to use as a filter and that depend on more than one column, you can use:
df = df[df[['col_1','col_2']].apply(lambda x: f(*x), axis=1)]
where f is a function that is applied to every pair of elements (x1, x2) from col_1 and col_2 and returns True or False depending on any condition you want on (x1, x2).
Start from pandas 0.13, this is the most efficient way.
df.query('Gender=="Male" & Year=="2014" ')
In case somebody wonders what is the faster way to filter (the accepted answer or the one from @redreamality):
import pandas as pd
import numpy as np
length = 100_000
df = pd.DataFrame()
df['Year'] = np.random.randint(1950, 2019, size=length)
df['Gender'] = np.random.choice(['Male', 'Female'], length)
%timeit df.query('Gender=="Male" & Year=="2014" ')
%timeit df[(df['Gender']=='Male') & (df['Year']==2014)]
Results for 100,000 rows:
6.67 ms ± 557 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.54 ms ± 536 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Results for 10,000,000 rows:
326 ms ± 6.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
472 ms ± 25.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So results depend on the size and the data. On my laptop, query()
gets faster after 500k rows. Further, the string search in Year=="2014"
has an unnecessary overhead (Year==2014
is faster).
You can create your own filter function using query
in pandas
. Here you have filtering of df
results by all the kwargs
parameters. Dont' forgot to add some validators(kwargs
filtering) to get filter function for your own df
.
def filter(df, **kwargs):
query_list = []
for key in kwargs.keys():
query_list.append(f'{key}=="{kwargs[key]}"')
query = ' & '.join(query_list)
return df.query(query)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With