Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Descriptive statistics in Python /with Pandas with std in parentheses

This question concerns the best-practice to do descriptive statistics in Python with a formatted output that correspond to tables found in academic publications: means with their respective standard deviations in parenthesis below. Final goal is to be able to export it in a Latex tabular format (or an other format, html, etc).

Example (Deucherta & Eugster (2018)):

enter image description here

Pandas:

The classical solution to do descriptive statistics in Pandas is to use the describe() method of a DataFrame.

import numpy as np
import pandas as pd 

# Generate a DataFrame to have an example 
df = pd.DataFrame(
    {"Age" : np.random.normal(20,15,5), 
     "Income": np.random.pareto(1,5)*20_000 }
    )
# The describe method to get means and stds
df.describe().loc[["mean", "std"]].T
>>>
                mean            std
Age        15.322797      13.449727
Income  97755.733510  143683.686484

What I would like to have is the following output:

Age        15.32 
          (13.44)
Income  97755.73  
        (143683.68)

It would be nice to have a solution that works with Multi-index Dataframe:

df2 = pd.DataFrame(
    {"Age" : np.random.normal(20,15,5), 
     "Income": np.random.pareto(1,5)*20_000 }
    )
df_c = pd.concat([df,df2], keys = ["A", "B"])
>>>

and get

                A           B
Age          23.15       21.33
            (11.62)      (9.34)
Income    68415.53    46619.51
         (95612.40)  (64596.10)

My current solution:

idx = pd.IndexSlice
df_desc = (df_c
      ).groupby(level = 0, axis = 0).describe()
df_desc = df_desc.loc[idx[:],idx[:,["mean", "std"]]].T
df_desc.loc[idx[:,["std"]],idx[:]] = df_desc.loc[idx[:,["std"]],idx[:]
                                               ].applymap(
                                               lambda x: "("+"{:.2f}".format(x)+")")
print(df_desc)

>>>
                     A           B
Age    mean     23.1565     21.3359
      std      (11.62)      (9.34)
Income mean     68415.5     46619.5
      std   (95612.40)  (64596.10)
Problem 1 :

I did not find the solution to hide the second index column [mean, std, mean,std].

Then I want to export my df to latex:

df_desc.to_latex()

>>>
\begin{tabular}{llll}
\toprule
       &     &            A &           B \\
\midrule
Age & mean &       5.5905 &     29.5894 \\
       & std &      (16.41) &     (13.03) \\
Income & mean &       531970 &     72653.7 \\
       & std &  (875272.44) &  (79690.18) \\
\bottomrule
\end{tabular}
Problem 2:

The & characters of the table are not aligned which makes it a bit tedious to edit (I use extensions for aligning & in VSCode)

Overall I find this solution tedious and not elegant.

Solution(s) ?

I do not know what I should do to obtain the desired result without complex string manipulation.

I have looked at Pandas styling, but I don't think that it is the best solution.

There is also StatModels Tables, but I did not find a simple solution to my problem. Statsmodels Tables seems the most promising solution. But I do not know how to implement it. There are some descriptive stats functions in StatsModels, but I read on GitHub that they are to some extent deprecated.

So what is the best way to make those tables?

like image 652
Adrien Pacifico Avatar asked Jan 31 '19 17:01

Adrien Pacifico


1 Answers

I just ran into a similar problem and found your post, so here's how I dealt with the issues you mentioned.

Problem 1: Hide second index column

I prefer solution b), but leave a) here for illustrative purposes.

a) droplevel & set_index

df_desc.index.droplevel(level=1)

>>>
Index(['Age', 'Age', 'Income', 'Income'], dtype='object')

Use this piece of code along with a set_index expression:

df_desc.set_index(df_desc.index.droplevel(level=1), inplace=True)

This results in:

print(df_desc)

>>>
                  A           B
Age         17.1527     23.9678
Age         (19.73)     (12.01)
Income       293271     12178.8
Income  (400059.27)  (14483.35)

which doesn't look satisfying yet, as the index values Age and Income appear twice, respectively.

That's why I came up with the following.

b) Create new DataFrame using pd.DataFrame.values and assign index manually

First, reset index:

df_desc = df_desc.reset_index(drop=True)

print(df_desc)

>>>
            A           B
0      17.306      11.425
1     (14.40)     (16.67)
2     88016.7     67280.4
3  (73054.44)  (54953.69)

Second, create new DataFrame specifying the index and column names manually. Note that I used df_desc.values as the data argument (first position).

df_new = pd.DataFrame(df_desc.values, index=["Age", "", "Income", ""], columns=["A", "B"])

print(df_new)

>>>
                 A           B
Age        27.7039     20.8031
           (13.99)     (13.92)
Income     20690.7     7370.44
        (29470.03)  (13279.10)

Problem 2: Align LaTeX table

Note that running

df_new.to_latex()

indeed produces a somewhat messy str output:

>>> 
'\\begin{tabular}{lll}\n\\toprule\n{} &           A &           B \\\\\n\\midrule\nAge    &     27.7039 &     20.8031 \\\\\n       &     (13.99) &     (13.92) \\\\\nIncome &     20690.7 &     7370.44 \\\\\n       &  (29470.03) &  (13279.10) \\\\\n\\bottomrule\n\\end{tabular}\n'

However, wrapping it inside a print statement produces the desired output:

print(df_new.to_latex())

>>>
\begin{tabular}{lll}
\toprule
{} &           A &           B \\
\midrule
Age    &     27.7039 &     20.8031 \\
       &     (13.99) &     (13.92) \\
Income &     20690.7 &     7370.44 \\
       &  (29470.03) &  (13279.10) \\
\bottomrule
\end{tabular}

Moreover, exporting the table to a LaTeX document is fairly simple.

As you noted yourself, to_latex() already creates a tabular, so you just need to write that to a file, and use \input in your LaTeX document. Following the example here, do the following:

i) Save the table as a text file

with open('mytable.tex','w') as tf:
    tf.write(df_new.to_latex())

ii) Use the exported table in a LaTeX document

\documentclass{article}
\usepackage{booktabs}
\begin{document}
\input{mytable}
\end{document}

This example here assumes that mytable.tex and the LaTeX document are in the same folder. The booktabs package is required, since to_latex() uses the booktabs commands for table rules.

The final pdf output looks like this:

enter image description here

like image 109
merowinger Avatar answered Sep 30 '22 16:09

merowinger