How to generate legible plots in pandas when looping over columns?

Question

Generate the dataframe for replicability:

df = pd.DataFrame(np.random.randn(50, 1000), columns=list('ABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDED'))

Check for normalcy of distribution of each variable (note: this takes a long time to run)

# Set the column names

columns= df.columns

# Loop over all columns

fig, axs = plt.subplots(len(df.columns), figsize=(5, 25))
for n, col in enumerate(df.columns):
    df[col].hist(ax=axs[n])

Result generates illegible histograms and takes a very long time to run.

The length of time is okay, but I am curious if anyone has suggestions for generating legible histograms (do not have to be fancy), which can be quickly reviewed for the entire dataframe to ensure the normality of the distributions.

Nathaniel · Accepted Answer

This code generates 1000 histograms and allows you to see each one in sufficient detail to understand how normally-distributed the columns are:

import pandas as pd
import matplotlib.pyplot as plt

cols = 1000
df = pd.DataFrame(np.random.normal(0, 1, [50, cols]))

# Loop over all columns
fig, ax = plt.subplots(figsize = (16, 10))
for n, col in enumerate(df.columns):
    plt.subplot(25, 40, n+1)
    df[col].hist(ax = plt.gca())
    plt.axis('off')
plt.tight_layout()

plt.savefig('1000_histograms.png', bbox_inches='tight', pad_inches = 0, dpi = 200)

1000 histograms

Another way to ascertain normality is with a QQ plot, which may be easier to visualize in bulk compared to a histogram:

import statsmodels.api as sm

cols = 1000
df = pd.DataFrame(np.random.normal(0,1, [50, cols]))

fig, axs = plt.subplots(figsize=(18, 12))
for n, col in enumerate(df.columns):
    plt.subplot(25,40,n+1)
    sm.qqplot(df[col], ax=plt.gca(), #line='45', 
              marker='.', markerfacecolor='C0', markeredgecolor='C0', 
              markersize=2)
#    sm.qqline(ax=plt.gca(), line='45', fmt='lightgray')
    plt.axis('off')

plt.savefig('1000_QQ_plots13.png', bbox_inches='tight', pad_inches=0, dpi=200)

1000 QQ plots

The closer each line is to a 45 degree diagonal, the more normally-distributed the column data is.

LoneWanderer · Answer

Plotting vs normality test
Proposition
Output example
Corresponding code sample

Plotting vs normality test

As discussed in comments below, the OP question has changed to thousands of plots management. From that perspective, Nathaniel answer's is appropriate.

However, I felt that the unsaid intent was to decide wheter a given variable was normally distributed or not, with thousands+ variables to consider.

Check for normalcy of distribution of each variable (note: this takes a long time to run)

With that in mind, it appears (to me) that having a human reviewing thousands of plots to spot normal/non-normal distributions is an innapropriate method. There is a french idiom for this: "usine à gaz" ("gas factory")

Therefore, this answer focuses on performing the analysis programmatically and provide some kind of more concise report.

Proposition

Perform analysis of data normality over a huge number of columns. It relies on the suggestion expressed in this answer.

The idea is to:

perform a distribution test (normality) for all columns
capitalize into a dataframe the results
Report into a graph the normal/non-normal ratios.
Report the non-normal column names.

With this method, we can further use programming to manipulate the normal/non-normal columns. For instance, we could perform additional distribution tests, or plot only the non normal distribution, thus reducing the number of graphs to actually observe.

Output example:

------------
Columns probably not a normal dist:
  Column  Not_Normal  p-value   Normality
0      V        True      0.0  Not Normal
0      W        True      0.0  Not Normal
0      X        True      0.0  Not Normal
0      Y        True      0.0  Not Normal
0      Z        True      0.0  Not Normal

enter image description here

Disclaimer: methods used may not be statistically "canonical". One should be very careful when using statistical tools, since each one as its specific usage domain/use case.

I chose a 0.01 (1%) p-value, since it could be the upcoming standard value in scientific publications instead of the usual 0.05 (5%))

One should read https://en.wikipedia.org/wiki/Normality_test

Tests of univariate normality include the following:

D'Agostino's K-squared test,
Jarque–Bera test,
Anderson–Darling test,
Cramér–von Mises criterion,
Lilliefors test,
Kolmogorov–Smirnov test
Shapiro–Wilk test, and
Pearson's chi-squared test.

Code

Behavior may vary on your computer depending on RNG (random numbers generation). The following example is made with 5 normal random sampling and 5 pareto random sampling using numpy. The normality test performs well in these conditions (even if I feel that the 0.0 p value tests are suspicious even for a pareto random generation) Nevertheless, I think we can agree that it is about the method, not actual the results.

import pandas as pd
import numpy as np
import scipy
from scipy import stats
import seaborn as sb
import matplotlib.pyplot as plt
import sys

print('System: {}'.format(sys.version))
for module in [pd, np, scipy, sb]:
    print('Module {:10s} - version {}'.format(module.__name__, module.__version__))

nb_lines = 10000
headers_normal = 'ABCDE'
headers_pareto = 'VWXYZ'
reapeat_factor = 1
nb_cols = len(list(reapeat_factor * headers_normal))

df_normal = pd.DataFrame(np.random.randn(nb_lines, nb_cols), columns=list(reapeat_factor * headers_normal))

df_pareto = pd.DataFrame((np.random.pareto(12.0, size=(nb_lines,nb_cols )) + 15.) * 4., columns=list(reapeat_factor * headers_pareto))

df = df_normal.join(df_pareto)

alpha = 0.01
df_list = list()

# normality code taken from https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html
cat_map = {True: 'Not Normal',
           False: 'Maybe Normal'}
for col in df.columns:
    k2, p = stats.normaltest(df[col])
    is_not_normal = p < alpha
    tmp_df = pd.DataFrame({'Column': [col],
                           'Not_Normal': [is_not_normal],
                           'p-value': [p],
                           'Normality': cat_map[is_not_normal]
                           })
    df_list.append(tmp_df)

df_results = pd.concat(df_list)
df_results['Normality'] = df_results['Normality'].astype('category')

print('------------')
print('Columns names probably not a normal dist:')
# full data
print(df_results[(df_results['Normality'] == 'Not Normal')])
# only column names
# print(df_results[(df_results['Normality'] == 'Not Normal')]['Column'])
print('------------')
print('Plotting countplot')
sb.countplot(data=df_results, y='Normality', orient='v')
plt.show()

Outputs:

System: 3.7.2 (default, Feb 21 2019, 17:35:59) [MSC v.1915 64 bit (AMD64)]
Module pandas     - version 0.24.1
Module numpy      - version 1.16.2
Module scipy      - version 1.2.1
Module seaborn    - version 0.9.0
------------
Columns names probably not a normal dist:
  Column  Not_Normal  p-value   Normality
0      V        True      0.0  Not Normal
0      W        True      0.0  Not Normal
0      X        True      0.0  Not Normal
0      Y        True      0.0  Not Normal
0      Z        True      0.0  Not Normal
------------
Plotting countplot

CAPSLOCK · Answer

I really like Nathaniel's answer but I will add my two cents.

I would go for seaborn and in particular seaborn.distplot. This will allow you to easily fit a normal distribution to each histogram plot and make the visualization easier.

import seaborn as sns
from scipy.stats import norm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

cols = 1000
df = pd.DataFrame(np.random.normal(0, 1, [50, cols]))
from scipy.stats import norm
fig, ax = plt.subplots(figsize = (16, 10))
for i, col in enumerate(df.columns):
    ax=fig.add_subplot(25, 4, i+1)
    sns.distplot(df[col],fit=norm, kde=False,ax=ax)
plt.tight_layout()

Additionally, I am not sure if putting columns with the same name in your example was done on purpose. If that's the case the easiest solution to loop through the columns is to use .iloc and the code would look like this:

import seaborn as sns
from scipy.stats import norm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

 df = pd.DataFrame(np.random.randn(50, 1000), columns=list('ABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDED'))

fig, ax = plt.subplots(figsize = (12, 10))
for i, col in enumerate(df.columns):
    plt.subplot(25, 40, i+1)
    sns.distplot(df.iloc[:,i],fit=norm, kde=False,ax=plt.gca())
    plt.axis('off')
plt.tight_layout()

enter image description here

How to generate legible plots in pandas when looping over columns?

Tags:

python

python-3.x

pandas

arkadiy

3 Answers

Nathaniel

Plotting vs normality test

Proposition

Output example:

Code

LoneWanderer

CAPSLOCK

Recent Activity

Donate For Us

How to generate legible plots in pandas when looping over columns?

Tags:

python

python-3.x

pandas

arkadiy

3 Answers

Nathaniel

Plotting vs normality test

Proposition

Output example:

Code

LoneWanderer

CAPSLOCK

Related questions

Recent Activity

Donate For Us