Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the equivalent of SQL "GROUP BY HAVING" on Pandas?

what would be the most efficient way to use groupby and in parallel apply a filter in pandas?

Basically I am asking for the equivalent in SQL of

select * ... group by col_name having condition 

I think there are many uses cases ranging from conditional means, sums, conditional probabilities, etc. which would make such a command very powerful.

I need a very good performance, so ideally such a command would not be the result of several layered operations done in python.

like image 842
Mannaggia Avatar asked Feb 28 '14 20:02

Mannaggia


People also ask

How do you get a group in a Groupby pandas?

By doing groupby() pandas returns you a dict of grouped DFs. You can easily get the key list of this dict by python built in function keys() .

What is group by () in pandas library?

Pandas groupby is used for grouping the data according to the categories and apply a function to the categories. It also helps to aggregate data efficiently. Pandas dataframe. groupby() function is used to split the data into groups based on some criteria.

Can I use pandas instead of SQL?

There are, of course, alternatives for both but they are the predominant ones in the field. Since both Pandas and SQL operate on tabular data, similar operations or queries can be done using both.

How do you use having in python?

The HAVING clause is used instead of WHERE with aggregate functions. While the GROUP BY Clause groups rows that have the same values into summary rows. The having clause is used with the where clause in order to find rows with certain conditions.

What is the difference between SQL and pandas groupby?

One other minor difference is that SQL uses the FROM statement to specify which dataset we are working with, i.e. the "train" table from the "titanic" schema; whereas in pandas, we put the name of the data frame in the beginning of the groupby command. It is also worth noting that SQL shows missing values when using GROUP BY.

How to use groupby in pandas?

As the pandas De v elopment Team stated elegantly on their documentation for the GroupBy object, Group By involves three steps: Step 1: Split the data into groups based on some criteria Step 2: Apply a function to each group independently Step 3: Combine the results into a data structure

What is the difference between embarked and pandas?

The first occurrence of "Embarked" is equivalent to pandas ' column indexing [Embarked]. One other minor difference is that SQL uses the FROM statement to specify which dataset we are working with, i.e. the "train" table from the "titanic" schema; whereas in pandas, we put the name of the data frame in the beginning of the groupby command.

What are pandas DataFrames and how to use them?

Pandas Dataframes ar very versatile, in terms of their capability to manipulate, reshape and munge data. One of the prominent features of a DataFrame is its capability to aggregate data. Most often, the aggregation capability is compared to the GROUP BY facility in SQL.


1 Answers

As mentioned in unutbu's comment, groupby's filter is the equivalent of SQL'S HAVING:

In [11]: df = pd.DataFrame([[1, 2], [1, 3], [5, 6]], columns=['A', 'B'])  In [12]: df Out[12]:    A  B 0  1  2 1  1  3 2  5  6  In [13]: g = df.groupby('A')  #  GROUP BY A  In [14]: g.filter(lambda x: len(x) > 1)  #  HAVING COUNT(*) > 1 Out[14]:    A  B 0  1  2 1  1  3 

You can write more complicated functions (these are applied to each group), provided they return a plain ol' bool:

In [15]: g.filter(lambda x: x['B'].sum() == 5) Out[15]:    A  B 0  1  2 1  1  3 

Note: potentially there is a bug where you can't write you function to act on the columns you've used to groupby... a workaround is the groupby the columns manually i.e. g = df.groupby(df['A'])).

like image 117
Andy Hayden Avatar answered Oct 17 '22 07:10

Andy Hayden