Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas dataframe: how to apply describe() to each group and add to new columns?

df:

name score A      1 A      2 A      3 A      4 A      5 B      2 B      4 B      6  B      8 

Want to get the following new dataframe in the form of below:

   name count mean std min 25% 50% 75% max     A     5    3    .. ..  ..  ..  ..  ..     B     4    5    .. ..  ..  ..  ..  .. 

How to exctract the information from df.describe() and reformat it? Thanks

like image 391
Robin1988 Avatar asked Nov 06 '15 20:11

Robin1988


People also ask

How do I add values to multiple columns in pandas?

By use + operator simply you can combine/merge two or multiple text/string columns in pandas DataFrame. Note that when you apply + operator on numeric columns it actually does addition instead of concatenation.

What does describe () do in Python?

The describe() method returns description of the data in the DataFrame. If the DataFrame contains numerical data, the description contains these information for each column: count - The number of not-empty values. mean - The average (mean) value.

What is the describe () function?

The describe() function is used to generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset's distribution, excluding NaN values. Syntax: DataFrame.describe(self, percentiles=None, include=None, exclude=None) Parameters: Name.


2 Answers

there is even a shorter one :)

print df.groupby('name').describe().unstack(1) 

Nothing beats one-liner:

In [145]:

print df.groupby('name').describe().reset_index().pivot(index='name', values='score', columns='level_1')

like image 197
Andrey Vykhodtsev Avatar answered Sep 20 '22 08:09

Andrey Vykhodtsev


Define some data

In[1]: import pandas as pd import io  data = """ name score A      1 A      2 A      3 A      4 A      5 B      2 B      4 B      6 B      8     """  df = pd.read_csv(io.StringIO(data), delimiter='\s+') print(df) 

.

Out[1]:   name  score 0    A      1 1    A      2 2    A      3 3    A      4 4    A      5 5    B      2 6    B      4 7    B      6 8    B      8 

Solution

A nice approach to this problem uses a generator expression (see footnote) to allow pd.DataFrame() to iterate over the results of groupby, and construct the summary stats dataframe on the fly:

In[2]: df2 = pd.DataFrame(group.describe().rename(columns={'score':name}).squeeze()                          for name, group in df.groupby('name'))  print(df2) 

.

Out[2]:    count  mean       std  min  25%  50%  75%  max A      5     3  1.581139    1  2.0    3  4.0    5 B      4     5  2.581989    2  3.5    5  6.5    8 

Here the squeeze function is squeezing out a dimension, to convert the one-column group summary stats Dataframe into a Series.

Footnote: A generator expression has the form my_function(a) for a in iterator, or if iterator gives us back two-element tuples, as in the case of groupby: my_function(a,b) for a,b in iterator

like image 29
Pedro M Duarte Avatar answered Sep 20 '22 08:09

Pedro M Duarte