Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas groupby and value_counts

I want to count distinct values per column (with pd.value_counts I guess) grouping data by some level in MultiIndex. The multiindex is taken care of with groupby(level= parameter, but apply raises a ValueError

Original dataframe:

>>> df = pd.DataFrame(np.random.choice(list('ABC'), size=(10,5)),
                 columns=['c1','c2','c3','c4','c5'], 
                 index=pd.MultiIndex.from_product([['foo', 'bar'], 
                                                   ['w','y','x','y','z']]))



      c1 c2 c3 c4 c5
foo w  C  C  B  A  A
    y  A  A  C  B  A
    x  A  B  C  C  C
    y  A  B  C  C  C
    z  A  C  B  C  B
bar w  B  C  C  A  C
    y  A  A  C  A  A
    x  A  B  B  B  A
    y  A  A  C  A  B
    z  A  B  B  C  B

What I want:

       c1  c2  c3  c4  c5
foo A   4   2   0   3   2
    B   1   2   2   1   2
    C   0   1   3   1   1
bar A   4   1   0   1   2
    B   0   2   2   1   1
    C   1   2   3   3   2

I try to do:

>>> df.groupby(level=0).apply(pd.value_counts)

ValueError: could not broadcast input array from shape (5,5) into shape (5)

I can do it myself manually, but I think it must be a more obvious way.

groups = [g.apply(pd.value_counts).fillna(0) for n, g in df.groupby(level=0)]
index = df.index.get_level_values(0).unique()
correct_result = pd.concat(groups, keys=index)   # THIS WORKS AS EXPECTED

I mean, this isn't that long to write, but I feel like I'm reinventing the wheel. Aren't this kind of operations done by groupby function?

Is there a more straightforward way of doing this, other than doing the split-apply-combine myself?

like image 378
Susensio Avatar asked Aug 11 '18 12:08

Susensio


People also ask

What does value_counts () do in pandas?

Return a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element.

How do you use value counts in Groupby?

Group by and value_counts Groupby is a very powerful pandas method. You can group by one column and count the values of another column per this column value using value_counts. Using groupby and value_counts we can count the number of activities each person did.

How do you use Groupby and count in pandas?

Use count() by Column NameUse pandas DataFrame. groupby() to group the rows by column and use count() method to get the count for each group by ignoring None and Nan values. It works with non-floating type data as well.

What is the difference between value_counts and count in pandas?

count() should be used when you want to find the frequency of valid values present in columns with respect to specified col . . value_counts() should be used to find the frequencies of a series.

How to groupby values count on the pandas Dataframe?

Let’s see how to Groupby values count on the pandas dataframe. To count Groupby values in the pandas dataframe we are going to use groupby () size () and unstack () method.

What are counts () and value_counts () in pandas?

Both counts () and value_counts () are great utilities for quickly understanding the shape of your data. In this post, we learned about groupby, count, and value_counts – three of the main methods in Pandas. Pandas is a powerful tool for manipulating data once you know the core operations and how to use it.

How to count NaN values in pandas Dataframe?

For value_counts use parameter dropna=True to count with NaN values. To start, here is the syntax that you may apply in order groupby and count in Pandas DataFrame: The DataFrame used in this article is available from Kaggle.

How does groupby work in pandas?

Similar to the SQL GROUP BY statement, the Pandas method works by splitting our data, aggregating it in a given way (or ways), and re-combining the data in a meaningful way. Because the .groupby () method works by first splitting the data, we can actually work with the groups directly.


1 Answers

Use stack for MultiIndex Series, then SeriesGroupBy.value_counts and last unstack for DataFrame:

np.random.seed(123)

df = pd.DataFrame(np.random.choice(list('ABC'), size=(10,5)),
                 columns=['c1','c2','c3','c4','c5'], 
                 index=pd.MultiIndex.from_product([['foo', 'bar'], 
                                                   ['w','y','x','y','z']]))
print (df)
      c1 c2 c3 c4 c5
foo w  C  B  C  C  A
    y  C  C  B  C  B
    x  C  B  A  B  C
    y  B  A  C  A  B
    z  C  B  A  A  A
bar w  A  B  C  A  C
    y  A  A  B  A  B
    x  A  A  A  C  B
    y  B  C  C  C  B
    z  A  A  C  B  A

df1 = df.stack().groupby(level=[0,2]).value_counts().unstack(1, fill_value=0)
print (df1)
       c1  c2  c3  c4  c5
bar A   4   3   1   2   1
    B   1   1   1   1   3
    C   0   1   3   2   1
foo A   0   1   2   2   2
    B   1   3   1   1   2
    C   4   1   2   2   1
like image 88
jezrael Avatar answered Nov 15 '22 01:11

jezrael