Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python pandas: How to group by and count unique values based on multiple columns?

I have datafarme df:

id name number
1 sam   76
2 sam    8
2 peter  8 
4 jack   2

I would like to group by on 'id' column and count the number of unique values based on the pair of (name,number)?

id count(name-number)
1    1
2    2
4    1     

I have tried this, but it does not work:

df.groupby('id')[('number','name')].nunique().reset_index()
like image 875
UserYmY Avatar asked Dec 08 '22 23:12

UserYmY


2 Answers

You can just combine two groupbys to get the desired result.

import pandas
df = pandas.DataFrame({"id": [1, 2, 2, 4], "name": ["sam", "sam", "peter", "jack"], "number": [8, 8, 8, 2]})
group = df.groupby(['id','name','number']).size().groupby(level=0).size()

The first groupby will count the complete set of original combinations (and thereby make the columns you want to count unique). The second groupby will count the unique occurences per the column you want (and you can use the fact that the first groupby put that column in the index).

The result will be a Series. If you want to have DataFrame with the right column name (as you showed in your desired result) you can use the aggregate function:

group = df.groupby(['id','name','number']).size().groupby(level=0).agg({'count(name-number':'size'})
like image 169
stedes Avatar answered Feb 09 '23 01:02

stedes


You can do:

import pandas
df = pandas.DataFrame({"id": [1, 2, 3, 4], "name": ["sam", "sam", "peter", "jack"], "number": [8, 8, 8, 2]})
g = df.groupby(["name", "number"])
print g.groups

which gives:

{('jack', 2): [3], ('peter', 8): [2], ('sam', 8): [0, 1]}

to get number of unique entries per pair you can do:

for p in g.groups: 
    print p, " has ", len(g.groups[p]), " entries"

which gives:

('peter', 8)  has  1  entries
('jack', 2)  has  1  entries
('sam', 8)  has  2  entries

update:

the OP asked for result in dataframe. One way to get this is to use aggregate with the length function, which will return a dataframe with the number of unique entries per pair:

d = g.aggregate(len)
print d.reset_index().rename(columns={"id": "num_entries"})

gives:

    name  number  num_entries
0   jack       2           1
1  peter       8           1
2    sam       8           2
like image 45
mvd Avatar answered Feb 09 '23 00:02

mvd