How can I get a similar summary of a Pandas dataframe as in R?

Examples

Nominal scale: A nominal scale only allows to check for equivalence. Examples for this are sex, names, city names. You can basically only count how often they appear and give the most common ones (the mode).
Ordinal scale: You can order, but not say how far one is away from another. Cloth sizes are one example. You can calculate the median / min / max for this scale.
Quantitative scales: You can calculate the mean, standard deviation, quantiles for those scales.

Code example

import pandas as pd
import pandas.rpy.common as rcom
df = rcom.load_data('mtcars')
print(df.describe())

gives

             mpg        cyl        disp          hp       drat         wt  \
count  32.000000  32.000000   32.000000   32.000000  32.000000  32.000000   
mean   20.090625   6.187500  230.721875  146.687500   3.596563   3.217250   
std     6.026948   1.785922  123.938694   68.562868   0.534679   0.978457   
min    10.400000   4.000000   71.100000   52.000000   2.760000   1.513000   
25%    15.425000   4.000000  120.825000   96.500000   3.080000   2.581250   
50%    19.200000   6.000000  196.300000  123.000000   3.695000   3.325000   
75%    22.800000   8.000000  326.000000  180.000000   3.920000   3.610000   
max    33.900000   8.000000  472.000000  335.000000   4.930000   5.424000   

            qsec         vs         am       gear     carb  
count  32.000000  32.000000  32.000000  32.000000  32.0000  
mean   17.848750   0.437500   0.406250   3.687500   2.8125  
std     1.786943   0.504016   0.498991   0.737804   1.6152  
min    14.500000   0.000000   0.000000   3.000000   1.0000  
25%    16.892500   0.000000   0.000000   3.000000   2.0000  
50%    17.710000   0.000000   0.000000   4.000000   2.0000  
75%    18.900000   1.000000   1.000000   4.000000   4.0000  
max    22.900000   1.000000   1.000000   5.000000   8.0000

This is not good as vs is a binary variable which indicates if the car has a v-engine or a straight engine (source). Hence the feature is of nominal scale. Hence min / max / std / mean are not applicable. It should rather be counted how often 0 and 1 appear.

In R, you can do the following:

mtcars$vs = factor(mtcars$vs, levels=c(0, 1), labels=c("straight engine", "V-Engine"))
mtcars$am = factor(mtcars$am, levels=c(0, 1), labels=c("Automatic", "Manual"))
mtcars$gear = factor(mtcars$gear)
mtcars$carb = factor(mtcars$carb)
summary(mtcars)

and get

      mpg             cyl             disp             hp             drat      
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0   Min.   :2.760  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5   1st Qu.:3.080  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0   Median :3.695  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7   Mean   :3.597  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0   3rd Qu.:3.920  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0   Max.   :4.930  
       wt             qsec                     vs             am     gear   carb  
 Min.   :1.513   Min.   :14.50   straight engine:18   Automatic:19   3:15   1: 7  
 1st Qu.:2.581   1st Qu.:16.89   V-Engine       :14   Manual   :13   4:12   2:10  
 Median :3.325   Median :17.71                                       5: 5   3: 3  
 Mean   :3.217   Mean   :17.85                                              4:10  
 3rd Qu.:3.610   3rd Qu.:18.90                                              6: 1  
 Max.   :5.424   Max.   :22.90                                              8: 1

Is something similar also possible with Pandas?

I tried

df["vs"] = df["vs"].astype('category')

But this makes "vs" disappear from the description.

471

asked Sep 07 '16 10:09

Martin Thoma

1 Answers

Late to the party, but I happen to have been struggling with some of the same issues lately, so I thought I would share my take on this challenge.

It seems to me that R is still better at handling categorical variables. There are however a few ways you could mimic some of this functionality using Python with pd.Categorical(), pd.GetDummies() and describe().

The challenge in this particular dataset is that the categorical variables have very different properties. For example am is 0 or 1 for automatic or manual gears, respectively. And gear is either 3, 4, or 5, but still most reasonably considered as a categorical rather than numerical value. So for am I would replace 0 and 1 with 'automatic' and 'categorical', but for gear I would apply pd.GetDummies() to get 0 or 1 for each category of gear in order to be able to easily count how many models that have, for example, 3 gears.

I've had a utility function lying around for some time that I improved a bit yesterday. It surely is not the most elegangt, but it should give you the same information as you got using the R snippet. Your final output table consists of columns with an unequal number of rows. Instead of making a similar table as a dataframe and filling it with NaNs, I split the information in two: One table for numerical values, and one for categoricial values so you end up with this:

                 count
Straight Engine     18
V engine            14
automatic           13
manual              19
cyl_4               11
cyl_6                7
cyl_8               14
gear_3              15
gear_4              12
gear_5               5
carb_1               7
carb_2              10
carb_3               3
carb_4              10
carb_6               1
carb_8               1
             mpg        disp          hp       drat         wt       qsec
count  32.000000   32.000000   32.000000  32.000000  32.000000  32.000000
mean   20.090625  230.721875  146.687500   3.596563   3.217250  17.848750
std     6.026948  123.938694   68.562868   0.534679   0.978457   1.786943
min    10.400000   71.100000   52.000000   2.760000   1.513000  14.500000
25%    15.425000  120.825000   96.500000   3.080000   2.581250  16.892500
50%    19.200000  196.300000  123.000000   3.695000   3.325000  17.710000
75%    22.800000  326.000000  180.000000   3.920000   3.610000  18.900000
max    33.900000  472.000000  335.000000   4.930000   5.424000  22.900000

Here's the whole procedure for an easy copy&paste:

# imports
import pandas as pd

# to easily access R datasets:
# pip install pydataset
from pydataset import data 

# Load dataset
df_mtcars = data('mtcars')


# The following variables: cat, dum, num and recoding
# are used in the function describeCat/df, dummies, recode, categorical) below

# Specify which variables are dummy variables [0 or 1], 
# ategorical [multiple categories] or numeric
cat = ['cyl', 'gear', 'carb']
dum = ['vs', 'am']
num = [c for c in list(df_mtcars) if c not in cat+dum]

# Also, define a dictionary that describes how some dummy variables should be recoded
# For example, in the series am, 0 is recoded as automatic and 1 as manual gears
recoding = {'am':['manual', 'automatic'], 'vs':['Straight Engine', 'V engine']}

# The function:
def describeCat(df, dummies, recode, categorical):
    """ Retrieves specified dummy and categorical variables
        from a pandas DataFrame and describes them (just count for now).

        Dummy variables [0 or 1] can be recoded to categorical variables
        by specifying a dictionary

    Keyword arguments:
    df -- pandas DataFrame
    dummies -- list of column names to specify dummy variables [0 or 1]
    recode -- dictionary to specify which and how dummyvariables should be recoded
    categorical -- list of columns names to specify catgorical variables

    """


    # Recode dummy variables
    recoded = []

    # DataFrame to store recoded variables
    df_recoded = pd.DataFrame()

    for dummy in dummies:
        if dummy in recode.keys():

            dummySeries = df[dummy].copy(deep = True).to_frame()
            dummySeries[dummy][dummySeries[dummy] == 0] = recode[dummy][0]
            dummySeries[dummy][dummySeries[dummy] == 1] = recode[dummy][1]
            recoded.append(pd.Categorical(dummySeries[dummy]).describe())  

            df_rec = pd.DataFrame(pd.Categorical(dummySeries[dummy]).describe())
            df_recoded = pd.concat([df_recoded.reset_index(),df_rec.reset_index()],
                                    ignore_index=True).set_index('categories')

    df_recoded = df_recoded['counts'].to_frame()

    # Rename columns and change datatype
    df_recoded['counts'] = df_recoded['counts'].astype(int)
    df_recoded.columns = ['count']


    # Since categorical variables will be transformed into dummy variables,
    # all remaining dummy variables (after recoding) can be treated the
    # same way as the categorical variables
    unrecoded = [var for var in dum if var not in recoding.keys()]
    categorical = categorical + unrecoded

    # Categorical split into dummy variables will have the same index
    # as the original dataframe
    allCats = pd.DataFrame(index = df.index)

    # apply pd.get_dummies on all categoirical variables
    for cat in categorical:
        newCats = pd.DataFrame(data = pd.get_dummies(pd.Categorical(df_mtcars[cat]), prefix = cat))
        newCats.index = df_mtcars.index
        allCats = pd.concat([allCats, newCats], axis = 1)
        df_cat = allCats.sum().to_frame()
    df_cat.columns = ['count']

    # gather output dataframes
    df_output = pd.concat([df_recoded, df_cat], axis = 0)


    return(df_output)

# Test run: Build a dataframe that describes the dummy and categorical variables
df_categorical = describeCat(df = df_mtcars, dummies = dum, recode = recoding, categorical = cat)

# describe numerical variables
df_numerical = df_mtcars[num].describe()

print(df_categorical)
print(df_numerical)

A sidenote about categorical variables and describe():

The reason why I'm using pd.Categorical() in the function above is that the output from describe() seems to be somewhat unstable. Sometimes df_mtcars['gear'].astype('category').describe() returns:

count    32.000000
mean      3.687500
std       0.737804
min       3.000000
25%       3.000000
50%       4.000000
75%       4.000000
max       5.000000
Name: gear, dtype: float64

While it should, given that it is considered a categorical variable, return:

count     32
unique     3
top        3
freq      15
Name: gear, dtype: int64

I may be wrong here, and I am having problems reproducing that issue, but I can swear this happens from time to time.

Using describe() on a pd.Categorical() gives an output of it's own format, but at least it seems to be stable.

            counts    freqs
categories                 
3               15  0.46875
4               12  0.37500
5                5  0.15625

And some last words about pd.get_dummies()

Here's what happens when you apply that function to df_mtcars['gear']:

# code
pd.get_dummies(df_mtcars['gear'].astype('category'), prefix = 'gear')

# output
                     gear_3  gear_4  gear_5
Mazda RX4                 0       1       0
Mazda RX4 Wag             0       1       0
Datsun 710                0       1       0
Hornet 4 Drive            1       0       0
Hornet Sportabout         1       0       0
Valiant                   1       0       0
.
.
.
Ferrari Dino              0       0       1
Maserati Bora             0       0       1
Volvo 142E                0       1       0

But in this case, I would simply use value_counts() so that you would get the following:

            counts    freqs
categories                 
3               15  0.46875
4               12  0.37500
5                5  0.15625

Which also happens to resemble the output from using describe() on a pd.Categorical() variable.

answered Oct 11 '22 14:10

vestland

Related questions
                            
                                What do low_memory and memory_map flags do in pd.read_csv
                            
                                pip install with shell completion
                            
                                Convert all relative imports to absolute automatically in python
                            
                                Proper way to cast numpy.matrix to C double pointer
                            
                                Why isn't locale.strxfrm("Gè") a prefix of locale.strxfrm("Gène")) with locale "fr_FR.UTF-8"?
                            
                                How to import two versions of the same python module at the same time?
                            
                                Is there a django admin widget for adding multiple foreign keys with an inline through_model
                            
                                How/where to store temp files and logs for a cloud app?
                            
                                Include list_route methods in Django REST framework's API root
                            
                                sqlalchemy CompileError Unconsumed column names when deleting row from m2m table
                            
                                Calculate how a value differs from the average of values using the Gaussian Kernel Density (Python)
                            
                                How can I show verbose py.test diffs without verbose test progress?
                            
                                PEP 0008: What does the BDFL mean by 'in true XP style'?
                            
                                How to randomly shuffle a list that has more permutations than the PRNG's period?
                            
                                Difference between list comprehension and generator comprehension with `yield` inside
                            
                                Computing symmetric Kullback-Leibler divergence between two documents
                            
                                How to connect django to docker redis container?
                            
                                OpenCV - specify format while writing image to file (cv2.imwrite)
                            
                                How to plot Pandas datetime series in Seaborn distplot?
                            
                                Difference between numpy ediff1d and diff

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I get a similar summary of a Pandas dataframe as in R?

Tags:

python

pandas

dataframe

r