Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sql select group by a having count(1) > 1 equivalent in python pandas?

I'm having a hard time filtering the groupby items in pandas. I want to do

select email, count(1) as cnt 
from customers 
group by email 
having count(email) > 1 
order by cnt desc

I did

customers.groupby('Email')['CustomerID'].size()

and it gives me the list of emails and their respective counts correctly but I am not able to achieve the having count(email) > 1 part.

email_cnt[email_cnt.size > 1]

returns 1

email_cnt = customers.groupby('Email')
email_dup = email_cnt.filter(lambda x:len(x) > 2)

gives the whole record of customers with email > 1 but I want the aggregate table.

like image 608
tangkk Avatar asked Dec 31 '14 08:12

tangkk


People also ask

How do you count and GroupBy in pandas?

Use count() by Column Name Use pandas DataFrame. groupby() to group the rows by column and use count() method to get the count for each group by ignoring None and Nan values.

How do you count occurrences in pandas Python?

Using the size() or count() method with pandas. DataFrame. groupby() will generate the count of a number of occurrences of data present in a particular column of the dataframe.

What does count () do in pandas?

Pandas DataFrame count() Method The count() method counts the number of not empty values for each row, or column if you specify the axis parameter as axis='columns' , and returns a Series object with the result for each row (or column).

How do I count the number of rows with a specific value in pandas?

Use Sum Function to Count Specific Values in a Column in a Dataframe. We can use the sum() function on a specified column to count values equal to a set condition, in this case we use == to get just rows equal to our specific data point.


2 Answers

Instead of writing email_cnt[email_cnt.size > 1], just write email_cnt[email_cnt > 1] (there's no need to call.size again). This uses the Boolean series email_cnt > 1 to return only the relevant values of email_cnt.

For example:

>>> customers = pd.DataFrame({'Email':['foo','bar','foo','foo','baz','bar'],
                              'CustomerID':[1,2,1,2,1,1]})
>>> email_cnt = customers.groupby('Email')['CustomerID'].size()
>>> email_cnt[email_cnt > 1]
Email
bar      2
foo      3
dtype: int64
like image 168
Alex Riley Avatar answered Sep 18 '22 18:09

Alex Riley


Two other solutions (with modern "method chain" approach):

Using selection by callable:

customers.groupby('Email').size().loc[lambda x: x>1].sort_values()

Using the query method:

(customers.groupby('Email')['CustomerID'].
    agg([len]).query('len > 1').sort_values('len'))
like image 34
Ilya V. Schurov Avatar answered Sep 18 '22 18:09

Ilya V. Schurov