what would be the most efficient way to use groupby and in parallel apply a filter in pandas?
Basically I am asking for the equivalent in SQL of
select * ... group by col_name having condition
I think there are many uses cases ranging from conditional means, sums, conditional probabilities, etc. which would make such a command very powerful.
I need a very good performance, so ideally such a command would not be the result of several layered operations done in python.
By doing groupby() pandas returns you a dict of grouped DFs. You can easily get the key list of this dict by python built in function keys() .
Pandas groupby is used for grouping the data according to the categories and apply a function to the categories. It also helps to aggregate data efficiently. Pandas dataframe. groupby() function is used to split the data into groups based on some criteria.
There are, of course, alternatives for both but they are the predominant ones in the field. Since both Pandas and SQL operate on tabular data, similar operations or queries can be done using both.
The HAVING clause is used instead of WHERE with aggregate functions. While the GROUP BY Clause groups rows that have the same values into summary rows. The having clause is used with the where clause in order to find rows with certain conditions.
One other minor difference is that SQL uses the FROM statement to specify which dataset we are working with, i.e. the "train" table from the "titanic" schema; whereas in pandas, we put the name of the data frame in the beginning of the groupby command. It is also worth noting that SQL shows missing values when using GROUP BY.
As the pandas De v elopment Team stated elegantly on their documentation for the GroupBy object, Group By involves three steps: Step 1: Split the data into groups based on some criteria Step 2: Apply a function to each group independently Step 3: Combine the results into a data structure
The first occurrence of "Embarked" is equivalent to pandas ' column indexing [Embarked]. One other minor difference is that SQL uses the FROM statement to specify which dataset we are working with, i.e. the "train" table from the "titanic" schema; whereas in pandas, we put the name of the data frame in the beginning of the groupby command.
Pandas Dataframes ar very versatile, in terms of their capability to manipulate, reshape and munge data. One of the prominent features of a DataFrame is its capability to aggregate data. Most often, the aggregation capability is compared to the GROUP BY facility in SQL.
As mentioned in unutbu's comment, groupby's filter is the equivalent of SQL'S HAVING:
In [11]: df = pd.DataFrame([[1, 2], [1, 3], [5, 6]], columns=['A', 'B']) In [12]: df Out[12]: A B 0 1 2 1 1 3 2 5 6 In [13]: g = df.groupby('A') # GROUP BY A In [14]: g.filter(lambda x: len(x) > 1) # HAVING COUNT(*) > 1 Out[14]: A B 0 1 2 1 1 3
You can write more complicated functions (these are applied to each group), provided they return a plain ol' bool:
In [15]: g.filter(lambda x: x['B'].sum() == 5) Out[15]: A B 0 1 2 1 1 3
Note: potentially there is a bug where you can't write you function to act on the columns you've used to groupby... a workaround is the groupby the columns manually i.e. g = df.groupby(df['A']))
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With