Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to count subgroups of categorical data in a pandas Dataframe?

I have the following pandas dataframe:

import pandas as pd
import numpy as np
df = pd.DataFrame({"shops": ["shop1", "shop2", "shop3", "shop4", "shop5", "shop6"], "franchise" : ["franchise_A", "franchise_A", "franchise_A", "franchise_A", "franchise_B", "franchise_B"],"items" : ["dog", "cat", "dog", "dog", "bird", "fish"]})
df = df[["shops", "franchise", "items"]]
print(df)

   shops    franchise items
0  shop1  franchise_A   dog
1  shop2  franchise_A   cat
2  shop3  franchise_A   dog
3  shop4  franchise_A   dog
4  shop5  franchise_B  bird
5  shop6  franchise_B  fish

So, each row is a unique sample shop1, shop2, etc. whereby each sample belongs to a subgroup franchise_A, franchise_B, franchise_C, etc. In the items column, there are only four categorical values possible: dog, cat, fish, bird. My motivation is to create a barplot of the number of dog, cat, fish, bird for each "franchise".

I would like the output to be

franchise        dogs    cats    birds    fish
franchise_A      3       1       0        0
franchise_B      0       0       1        1

I believe I first have to use groupby(), e.g.

df.groupby("franchise").count()
             shops  items
franchise                
franchise_A      4      4
franchise_B      2      2

But I'm not sure how I count the number of items for each franchise.

like image 661
ShanZhengYang Avatar asked Mar 02 '17 18:03

ShanZhengYang


People also ask

How do you count categorical data in pandas?

Pandas value_counts() can get counts of unique values of columns in a Pandas dataframe. Starting from Pandas version 1.1. 0, we can use value_counts() on a Pandas Series and dataframe as well.

How do you count categories in a Dataframe in Python?

To count the number of occurrences in e.g. a column in a dataframe you can use Pandas value_counts() method. For example, if you type df['condition']. value_counts() you will get the frequency of each unique value in the column “condition”.

How do you count the number of records for a categorical variable?

When we have two categorical variables then each of them is likely to have different number of rows for the other variable. This helps us to understand the combinatorial values of those two categorical variables. We can find such type of rows using count function of dplyr package.

How do you count the number of occurrences in pandas Dataframe?

Using the size() or count() method with pandas. DataFrame. groupby() will generate the count of a number of occurrences of data present in a particular column of the dataframe.


2 Answers

You can use value_counts with unstack, thanks Nickil Maveli:

from collections import Counter

print (df.groupby("franchise")['items'].value_counts().unstack(fill_value=0))
items        bird  cat  dog  fish
franchise                        
franchise_A     0    1    3     0
franchise_B     1    0    0     1

Another solutions with crosstab and pivot_table:

print (pd.crosstab(df["franchise"], df['items']))
items        bird  cat  dog  fish
franchise                        
franchise_A     0    1    3     0
franchise_B     1    0    0     1

print (df.pivot_table(index="franchise", columns='items', aggfunc='size', fill_value=0))
items        bird  cat  dog  fish
franchise                        
franchise_A     0    1    3     0
franchise_B     1    0    0     1
like image 100
jezrael Avatar answered Sep 29 '22 08:09

jezrael


You could include the items column in the groupby, then use size.

>>> df.groupby(['franchise', 'items']).size().unstack(fill_value=0)

items        bird  cat  dog  fish
franchise                        
franchise_A     0    1    3     0
franchise_B     1    0    0     1

(Rough) Benchmark

%timeit df.groupby(['franchise', 'items']).size().unstack(fill_value=0)
100 loops, best of 3: 2.73 ms per loop

%timeit (df.groupby("franchise")['items'].apply(Counter).unstack(fill_value=0).astype(int))
100 loops, best of 3: 4.18 ms per loop

%timeit df.groupby('franchise')['items'].value_counts().unstack(fill_value=0)
100 loops, best of 3: 2.71 ms per loop
like image 37
miradulo Avatar answered Sep 29 '22 08:09

miradulo