Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I get a similar summary of a Pandas dataframe as in R?

Different scales allow different types of operations. I would like to specify the scale of a column in a dataframe df. Then, df.describe() should take this into account.

Examples

  • Nominal scale: A nominal scale only allows to check for equivalence. Examples for this are sex, names, city names. You can basically only count how often they appear and give the most common ones (the mode).
  • Ordinal scale: You can order, but not say how far one is away from another. Cloth sizes are one example. You can calculate the median / min / max for this scale.
  • Quantitative scales: You can calculate the mean, standard deviation, quantiles for those scales.

Code example

import pandas as pd
import pandas.rpy.common as rcom
df = rcom.load_data('mtcars')
print(df.describe())

gives

             mpg        cyl        disp          hp       drat         wt  \
count  32.000000  32.000000   32.000000   32.000000  32.000000  32.000000   
mean   20.090625   6.187500  230.721875  146.687500   3.596563   3.217250   
std     6.026948   1.785922  123.938694   68.562868   0.534679   0.978457   
min    10.400000   4.000000   71.100000   52.000000   2.760000   1.513000   
25%    15.425000   4.000000  120.825000   96.500000   3.080000   2.581250   
50%    19.200000   6.000000  196.300000  123.000000   3.695000   3.325000   
75%    22.800000   8.000000  326.000000  180.000000   3.920000   3.610000   
max    33.900000   8.000000  472.000000  335.000000   4.930000   5.424000   

            qsec         vs         am       gear     carb  
count  32.000000  32.000000  32.000000  32.000000  32.0000  
mean   17.848750   0.437500   0.406250   3.687500   2.8125  
std     1.786943   0.504016   0.498991   0.737804   1.6152  
min    14.500000   0.000000   0.000000   3.000000   1.0000  
25%    16.892500   0.000000   0.000000   3.000000   2.0000  
50%    17.710000   0.000000   0.000000   4.000000   2.0000  
75%    18.900000   1.000000   1.000000   4.000000   4.0000  
max    22.900000   1.000000   1.000000   5.000000   8.0000  

This is not good as vs is a binary variable which indicates if the car has a v-engine or a straight engine (source). Hence the feature is of nominal scale. Hence min / max / std / mean are not applicable. It should rather be counted how often 0 and 1 appear.

In R, you can do the following:

mtcars$vs = factor(mtcars$vs, levels=c(0, 1), labels=c("straight engine", "V-Engine"))
mtcars$am = factor(mtcars$am, levels=c(0, 1), labels=c("Automatic", "Manual"))
mtcars$gear = factor(mtcars$gear)
mtcars$carb = factor(mtcars$carb)
summary(mtcars)

and get

      mpg             cyl             disp             hp             drat      
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0   Min.   :2.760  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5   1st Qu.:3.080  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0   Median :3.695  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7   Mean   :3.597  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0   3rd Qu.:3.920  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0   Max.   :4.930  
       wt             qsec                     vs             am     gear   carb  
 Min.   :1.513   Min.   :14.50   straight engine:18   Automatic:19   3:15   1: 7  
 1st Qu.:2.581   1st Qu.:16.89   V-Engine       :14   Manual   :13   4:12   2:10  
 Median :3.325   Median :17.71                                       5: 5   3: 3  
 Mean   :3.217   Mean   :17.85                                              4:10  
 3rd Qu.:3.610   3rd Qu.:18.90                                              6: 1  
 Max.   :5.424   Max.   :22.90                                              8: 1  

Is something similar also possible with Pandas?

I tried

df["vs"] = df["vs"].astype('category')

But this makes "vs" disappear from the description.

like image 471
Martin Thoma Avatar asked Sep 07 '16 10:09

Martin Thoma


People also ask

How do you get a statistical summary of a DataFrame DF?

The describe() function computes a summary of statistics pertaining to the DataFrame columns. This function gives the mean, std and IQR values. And, function excludes the character columns and given summary about numeric columns.

How do I display a summary of a DataFrame in R?

In the following program, we take a Data Frame in df with two numeric columns, and find its summary using summary() function. Output contains summary which is column wise. In the first column named “a”, minimum value is 41.0, maximum value is 43.0, mean 42.0, and so on.

What is the pandas equivalent in R?

Pandas for Python and Dplyr for R are the two most popular libraries for working with tabular/structured data for many Data Scientists.

Which method is used to show the summary statistics of a DataFrame?

A quick introduction to Pandas Describe The describe() method computes and displays summary statistics for a Python dataframe. (It also operates on dataframe columns and Pandas series objects.)


1 Answers

Late to the party, but I happen to have been struggling with some of the same issues lately, so I thought I would share my take on this challenge.


It seems to me that R is still better at handling categorical variables. There are however a few ways you could mimic some of this functionality using Python with pd.Categorical(), pd.GetDummies() and describe().

The challenge in this particular dataset is that the categorical variables have very different properties. For example am is 0 or 1 for automatic or manual gears, respectively. And gear is either 3, 4, or 5, but still most reasonably considered as a categorical rather than numerical value. So for am I would replace 0 and 1 with 'automatic' and 'categorical', but for gear I would apply pd.GetDummies() to get 0 or 1 for each category of gear in order to be able to easily count how many models that have, for example, 3 gears.

I've had a utility function lying around for some time that I improved a bit yesterday. It surely is not the most elegangt, but it should give you the same information as you got using the R snippet. Your final output table consists of columns with an unequal number of rows. Instead of making a similar table as a dataframe and filling it with NaNs, I split the information in two: One table for numerical values, and one for categoricial values so you end up with this:

                 count
Straight Engine     18
V engine            14
automatic           13
manual              19
cyl_4               11
cyl_6                7
cyl_8               14
gear_3              15
gear_4              12
gear_5               5
carb_1               7
carb_2              10
carb_3               3
carb_4              10
carb_6               1
carb_8               1
             mpg        disp          hp       drat         wt       qsec
count  32.000000   32.000000   32.000000  32.000000  32.000000  32.000000
mean   20.090625  230.721875  146.687500   3.596563   3.217250  17.848750
std     6.026948  123.938694   68.562868   0.534679   0.978457   1.786943
min    10.400000   71.100000   52.000000   2.760000   1.513000  14.500000
25%    15.425000  120.825000   96.500000   3.080000   2.581250  16.892500
50%    19.200000  196.300000  123.000000   3.695000   3.325000  17.710000
75%    22.800000  326.000000  180.000000   3.920000   3.610000  18.900000
max    33.900000  472.000000  335.000000   4.930000   5.424000  22.900000

Here's the whole procedure for an easy copy&paste:

# imports
import pandas as pd

# to easily access R datasets:
# pip install pydataset
from pydataset import data 

# Load dataset
df_mtcars = data('mtcars')


# The following variables: cat, dum, num and recoding
# are used in the function describeCat/df, dummies, recode, categorical) below

# Specify which variables are dummy variables [0 or 1], 
# ategorical [multiple categories] or numeric
cat = ['cyl', 'gear', 'carb']
dum = ['vs', 'am']
num = [c for c in list(df_mtcars) if c not in cat+dum]

# Also, define a dictionary that describes how some dummy variables should be recoded
# For example, in the series am, 0 is recoded as automatic and 1 as manual gears
recoding = {'am':['manual', 'automatic'], 'vs':['Straight Engine', 'V engine']}

# The function:
def describeCat(df, dummies, recode, categorical):
    """ Retrieves specified dummy and categorical variables
        from a pandas DataFrame and describes them (just count for now).

        Dummy variables [0 or 1] can be recoded to categorical variables
        by specifying a dictionary

    Keyword arguments:
    df -- pandas DataFrame
    dummies -- list of column names to specify dummy variables [0 or 1]
    recode -- dictionary to specify which and how dummyvariables should be recoded
    categorical -- list of columns names to specify catgorical variables

    """


    # Recode dummy variables
    recoded = []

    # DataFrame to store recoded variables
    df_recoded = pd.DataFrame()

    for dummy in dummies:
        if dummy in recode.keys():

            dummySeries = df[dummy].copy(deep = True).to_frame()
            dummySeries[dummy][dummySeries[dummy] == 0] = recode[dummy][0]
            dummySeries[dummy][dummySeries[dummy] == 1] = recode[dummy][1]
            recoded.append(pd.Categorical(dummySeries[dummy]).describe())  

            df_rec = pd.DataFrame(pd.Categorical(dummySeries[dummy]).describe())
            df_recoded = pd.concat([df_recoded.reset_index(),df_rec.reset_index()],
                                    ignore_index=True).set_index('categories')

    df_recoded = df_recoded['counts'].to_frame()

    # Rename columns and change datatype
    df_recoded['counts'] = df_recoded['counts'].astype(int)
    df_recoded.columns = ['count']


    # Since categorical variables will be transformed into dummy variables,
    # all remaining dummy variables (after recoding) can be treated the
    # same way as the categorical variables
    unrecoded = [var for var in dum if var not in recoding.keys()]
    categorical = categorical + unrecoded

    # Categorical split into dummy variables will have the same index
    # as the original dataframe
    allCats = pd.DataFrame(index = df.index)

    # apply pd.get_dummies on all categoirical variables
    for cat in categorical:
        newCats = pd.DataFrame(data = pd.get_dummies(pd.Categorical(df_mtcars[cat]), prefix = cat))
        newCats.index = df_mtcars.index
        allCats = pd.concat([allCats, newCats], axis = 1)
        df_cat = allCats.sum().to_frame()
    df_cat.columns = ['count']

    # gather output dataframes
    df_output = pd.concat([df_recoded, df_cat], axis = 0)


    return(df_output)

# Test run: Build a dataframe that describes the dummy and categorical variables
df_categorical = describeCat(df = df_mtcars, dummies = dum, recode = recoding, categorical = cat)

# describe numerical variables
df_numerical = df_mtcars[num].describe()

print(df_categorical)
print(df_numerical)

A sidenote about categorical variables and describe():

The reason why I'm using pd.Categorical() in the function above is that the output from describe() seems to be somewhat unstable. Sometimes df_mtcars['gear'].astype('category').describe() returns:

count    32.000000
mean      3.687500
std       0.737804
min       3.000000
25%       3.000000
50%       4.000000
75%       4.000000
max       5.000000
Name: gear, dtype: float64

While it should, given that it is considered a categorical variable, return:

count     32
unique     3
top        3
freq      15
Name: gear, dtype: int64

I may be wrong here, and I am having problems reproducing that issue, but I can swear this happens from time to time.

Using describe() on a pd.Categorical() gives an output of it's own format, but at least it seems to be stable.

            counts    freqs
categories                 
3               15  0.46875
4               12  0.37500
5                5  0.15625

And some last words about pd.get_dummies()

Here's what happens when you apply that function to df_mtcars['gear']:

# code
pd.get_dummies(df_mtcars['gear'].astype('category'), prefix = 'gear')

# output
                     gear_3  gear_4  gear_5
Mazda RX4                 0       1       0
Mazda RX4 Wag             0       1       0
Datsun 710                0       1       0
Hornet 4 Drive            1       0       0
Hornet Sportabout         1       0       0
Valiant                   1       0       0
.
.
.
Ferrari Dino              0       0       1
Maserati Bora             0       0       1
Volvo 142E                0       1       0

But in this case, I would simply use value_counts() so that you would get the following:

            counts    freqs
categories                 
3               15  0.46875
4               12  0.37500
5                5  0.15625

Which also happens to resemble the output from using describe() on a pd.Categorical() variable.

like image 86
vestland Avatar answered Oct 11 '22 14:10

vestland