Different scales allow different types of operations. I would like to specify the scale of a column in a dataframe df
. Then, df.describe()
should take this into account.
import pandas as pd
import pandas.rpy.common as rcom
df = rcom.load_data('mtcars')
print(df.describe())
gives
mpg cyl disp hp drat wt \
count 32.000000 32.000000 32.000000 32.000000 32.000000 32.000000
mean 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250
std 6.026948 1.785922 123.938694 68.562868 0.534679 0.978457
min 10.400000 4.000000 71.100000 52.000000 2.760000 1.513000
25% 15.425000 4.000000 120.825000 96.500000 3.080000 2.581250
50% 19.200000 6.000000 196.300000 123.000000 3.695000 3.325000
75% 22.800000 8.000000 326.000000 180.000000 3.920000 3.610000
max 33.900000 8.000000 472.000000 335.000000 4.930000 5.424000
qsec vs am gear carb
count 32.000000 32.000000 32.000000 32.000000 32.0000
mean 17.848750 0.437500 0.406250 3.687500 2.8125
std 1.786943 0.504016 0.498991 0.737804 1.6152
min 14.500000 0.000000 0.000000 3.000000 1.0000
25% 16.892500 0.000000 0.000000 3.000000 2.0000
50% 17.710000 0.000000 0.000000 4.000000 2.0000
75% 18.900000 1.000000 1.000000 4.000000 4.0000
max 22.900000 1.000000 1.000000 5.000000 8.0000
This is not good as vs
is a binary variable which indicates if the car has a v-engine or a straight engine (source). Hence the feature is of nominal scale. Hence min / max / std / mean are not applicable. It should rather be counted how often 0 and 1 appear.
In R, you can do the following:
mtcars$vs = factor(mtcars$vs, levels=c(0, 1), labels=c("straight engine", "V-Engine"))
mtcars$am = factor(mtcars$am, levels=c(0, 1), labels=c("Automatic", "Manual"))
mtcars$gear = factor(mtcars$gear)
mtcars$carb = factor(mtcars$carb)
summary(mtcars)
and get
mpg cyl disp hp drat
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :2.760
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080
Median :19.20 Median :6.000 Median :196.3 Median :123.0 Median :3.695
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 Mean :3.597
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 Max. :4.930
wt qsec vs am gear carb
Min. :1.513 Min. :14.50 straight engine:18 Automatic:19 3:15 1: 7
1st Qu.:2.581 1st Qu.:16.89 V-Engine :14 Manual :13 4:12 2:10
Median :3.325 Median :17.71 5: 5 3: 3
Mean :3.217 Mean :17.85 4:10
3rd Qu.:3.610 3rd Qu.:18.90 6: 1
Max. :5.424 Max. :22.90 8: 1
Is something similar also possible with Pandas?
I tried
df["vs"] = df["vs"].astype('category')
But this makes "vs"
disappear from the description.
The describe() function computes a summary of statistics pertaining to the DataFrame columns. This function gives the mean, std and IQR values. And, function excludes the character columns and given summary about numeric columns.
In the following program, we take a Data Frame in df with two numeric columns, and find its summary using summary() function. Output contains summary which is column wise. In the first column named “a”, minimum value is 41.0, maximum value is 43.0, mean 42.0, and so on.
Pandas for Python and Dplyr for R are the two most popular libraries for working with tabular/structured data for many Data Scientists.
A quick introduction to Pandas Describe The describe() method computes and displays summary statistics for a Python dataframe. (It also operates on dataframe columns and Pandas series objects.)
Late to the party, but I happen to have been struggling with some of the same issues lately, so I thought I would share my take on this challenge.
It seems to me that R is still better at handling categorical variables. There are however a few ways you could mimic some of this functionality using Python with pd.Categorical()
, pd.GetDummies()
and describe()
.
The challenge in this particular dataset is that the categorical variables have very different properties. For example am is 0 or 1
for automatic or manual gears, respectively. And gear is either 3, 4, or 5
, but still most reasonably considered as a categorical rather than numerical value. So for am
I would replace 0 and 1 with 'automatic' and 'categorical', but for gear I would apply pd.GetDummies()
to get 0 or 1 for each category of gear in order to be able to easily count how many models that have, for example, 3 gears.
I've had a utility function lying around for some time that I improved a bit yesterday. It surely is not the most elegangt, but it should give you the same information as you got using the R snippet. Your final output table consists of columns with an unequal number of rows. Instead of making a similar table as a dataframe and filling it with NaNs, I split the information in two: One table for numerical values, and one for categoricial values so you end up with this:
count
Straight Engine 18
V engine 14
automatic 13
manual 19
cyl_4 11
cyl_6 7
cyl_8 14
gear_3 15
gear_4 12
gear_5 5
carb_1 7
carb_2 10
carb_3 3
carb_4 10
carb_6 1
carb_8 1
mpg disp hp drat wt qsec
count 32.000000 32.000000 32.000000 32.000000 32.000000 32.000000
mean 20.090625 230.721875 146.687500 3.596563 3.217250 17.848750
std 6.026948 123.938694 68.562868 0.534679 0.978457 1.786943
min 10.400000 71.100000 52.000000 2.760000 1.513000 14.500000
25% 15.425000 120.825000 96.500000 3.080000 2.581250 16.892500
50% 19.200000 196.300000 123.000000 3.695000 3.325000 17.710000
75% 22.800000 326.000000 180.000000 3.920000 3.610000 18.900000
max 33.900000 472.000000 335.000000 4.930000 5.424000 22.900000
Here's the whole procedure for an easy copy&paste:
# imports
import pandas as pd
# to easily access R datasets:
# pip install pydataset
from pydataset import data
# Load dataset
df_mtcars = data('mtcars')
# The following variables: cat, dum, num and recoding
# are used in the function describeCat/df, dummies, recode, categorical) below
# Specify which variables are dummy variables [0 or 1],
# ategorical [multiple categories] or numeric
cat = ['cyl', 'gear', 'carb']
dum = ['vs', 'am']
num = [c for c in list(df_mtcars) if c not in cat+dum]
# Also, define a dictionary that describes how some dummy variables should be recoded
# For example, in the series am, 0 is recoded as automatic and 1 as manual gears
recoding = {'am':['manual', 'automatic'], 'vs':['Straight Engine', 'V engine']}
# The function:
def describeCat(df, dummies, recode, categorical):
""" Retrieves specified dummy and categorical variables
from a pandas DataFrame and describes them (just count for now).
Dummy variables [0 or 1] can be recoded to categorical variables
by specifying a dictionary
Keyword arguments:
df -- pandas DataFrame
dummies -- list of column names to specify dummy variables [0 or 1]
recode -- dictionary to specify which and how dummyvariables should be recoded
categorical -- list of columns names to specify catgorical variables
"""
# Recode dummy variables
recoded = []
# DataFrame to store recoded variables
df_recoded = pd.DataFrame()
for dummy in dummies:
if dummy in recode.keys():
dummySeries = df[dummy].copy(deep = True).to_frame()
dummySeries[dummy][dummySeries[dummy] == 0] = recode[dummy][0]
dummySeries[dummy][dummySeries[dummy] == 1] = recode[dummy][1]
recoded.append(pd.Categorical(dummySeries[dummy]).describe())
df_rec = pd.DataFrame(pd.Categorical(dummySeries[dummy]).describe())
df_recoded = pd.concat([df_recoded.reset_index(),df_rec.reset_index()],
ignore_index=True).set_index('categories')
df_recoded = df_recoded['counts'].to_frame()
# Rename columns and change datatype
df_recoded['counts'] = df_recoded['counts'].astype(int)
df_recoded.columns = ['count']
# Since categorical variables will be transformed into dummy variables,
# all remaining dummy variables (after recoding) can be treated the
# same way as the categorical variables
unrecoded = [var for var in dum if var not in recoding.keys()]
categorical = categorical + unrecoded
# Categorical split into dummy variables will have the same index
# as the original dataframe
allCats = pd.DataFrame(index = df.index)
# apply pd.get_dummies on all categoirical variables
for cat in categorical:
newCats = pd.DataFrame(data = pd.get_dummies(pd.Categorical(df_mtcars[cat]), prefix = cat))
newCats.index = df_mtcars.index
allCats = pd.concat([allCats, newCats], axis = 1)
df_cat = allCats.sum().to_frame()
df_cat.columns = ['count']
# gather output dataframes
df_output = pd.concat([df_recoded, df_cat], axis = 0)
return(df_output)
# Test run: Build a dataframe that describes the dummy and categorical variables
df_categorical = describeCat(df = df_mtcars, dummies = dum, recode = recoding, categorical = cat)
# describe numerical variables
df_numerical = df_mtcars[num].describe()
print(df_categorical)
print(df_numerical)
A sidenote about categorical variables and describe():
The reason why I'm using pd.Categorical()
in the function above is that the output from describe()
seems to be somewhat unstable. Sometimes df_mtcars['gear'].astype('category').describe()
returns:
count 32.000000
mean 3.687500
std 0.737804
min 3.000000
25% 3.000000
50% 4.000000
75% 4.000000
max 5.000000
Name: gear, dtype: float64
While it should, given that it is considered a categorical variable, return:
count 32
unique 3
top 3
freq 15
Name: gear, dtype: int64
I may be wrong here, and I am having problems reproducing that issue, but I can swear this happens from time to time.
Using describe()
on a pd.Categorical()
gives an output of it's own format, but at least it seems to be stable.
counts freqs
categories
3 15 0.46875
4 12 0.37500
5 5 0.15625
And some last words about pd.get_dummies()
Here's what happens when you apply that function to df_mtcars['gear']
:
# code
pd.get_dummies(df_mtcars['gear'].astype('category'), prefix = 'gear')
# output
gear_3 gear_4 gear_5
Mazda RX4 0 1 0
Mazda RX4 Wag 0 1 0
Datsun 710 0 1 0
Hornet 4 Drive 1 0 0
Hornet Sportabout 1 0 0
Valiant 1 0 0
.
.
.
Ferrari Dino 0 0 1
Maserati Bora 0 0 1
Volvo 142E 0 1 0
But in this case, I would simply use value_counts()
so that you would get the following:
counts freqs
categories
3 15 0.46875
4 12 0.37500
5 5 0.15625
Which also happens to resemble the output from using describe()
on a pd.Categorical()
variable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With