Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to generate legible plots in pandas when looping over columns?

Generate the dataframe for replicability:

df = pd.DataFrame(np.random.randn(50, 1000), columns=list('ABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDED'))

Check for normalcy of distribution of each variable (note: this takes a long time to run)

# Set the column names

columns= df.columns

# Loop over all columns

fig, axs = plt.subplots(len(df.columns), figsize=(5, 25))
for n, col in enumerate(df.columns):
    df[col].hist(ax=axs[n])

Result generates illegible histograms and takes a very long time to run.

The length of time is okay, but I am curious if anyone has suggestions for generating legible histograms (do not have to be fancy), which can be quickly reviewed for the entire dataframe to ensure the normality of the distributions.

like image 254
arkadiy Avatar asked Mar 14 '19 23:03

arkadiy


3 Answers

This code generates 1000 histograms and allows you to see each one in sufficient detail to understand how normally-distributed the columns are:

import pandas as pd
import matplotlib.pyplot as plt

cols = 1000
df = pd.DataFrame(np.random.normal(0, 1, [50, cols]))

# Loop over all columns
fig, ax = plt.subplots(figsize = (16, 10))
for n, col in enumerate(df.columns):
    plt.subplot(25, 40, n+1)
    df[col].hist(ax = plt.gca())
    plt.axis('off')
plt.tight_layout()

plt.savefig('1000_histograms.png', bbox_inches='tight', pad_inches = 0, dpi = 200)

1000 histograms

Another way to ascertain normality is with a QQ plot, which may be easier to visualize in bulk compared to a histogram:

import statsmodels.api as sm

cols = 1000
df = pd.DataFrame(np.random.normal(0,1, [50, cols]))

fig, axs = plt.subplots(figsize=(18, 12))
for n, col in enumerate(df.columns):
    plt.subplot(25,40,n+1)
    sm.qqplot(df[col], ax=plt.gca(), #line='45', 
              marker='.', markerfacecolor='C0', markeredgecolor='C0', 
              markersize=2)
#    sm.qqline(ax=plt.gca(), line='45', fmt='lightgray')
    plt.axis('off')

plt.savefig('1000_QQ_plots13.png', bbox_inches='tight', pad_inches=0, dpi=200)

1000 QQ plots

The closer each line is to a 45 degree diagonal, the more normally-distributed the column data is.

like image 56
Nathaniel Avatar answered Oct 18 '22 16:10

Nathaniel


  1. Plotting vs normality test
  2. Proposition
  3. Output example
  4. Corresponding code sample

Plotting vs normality test

As discussed in comments below, the OP question has changed to thousands of plots management. From that perspective, Nathaniel answer's is appropriate.

However, I felt that the unsaid intent was to decide wheter a given variable was normally distributed or not, with thousands+ variables to consider.

Check for normalcy of distribution of each variable (note: this takes a long time to run)

With that in mind, it appears (to me) that having a human reviewing thousands of plots to spot normal/non-normal distributions is an innapropriate method. There is a french idiom for this: "usine à gaz" ("gas factory")

Therefore, this answer focuses on performing the analysis programmatically and provide some kind of more concise report.

Proposition

Perform analysis of data normality over a huge number of columns. It relies on the suggestion expressed in this answer.

The idea is to:

  • perform a distribution test (normality) for all columns
  • capitalize into a dataframe the results
  • Report into a graph the normal/non-normal ratios.
  • Report the non-normal column names.

With this method, we can further use programming to manipulate the normal/non-normal columns. For instance, we could perform additional distribution tests, or plot only the non normal distribution, thus reducing the number of graphs to actually observe.

Output example:

------------
Columns probably not a normal dist:
  Column  Not_Normal  p-value   Normality
0      V        True      0.0  Not Normal
0      W        True      0.0  Not Normal
0      X        True      0.0  Not Normal
0      Y        True      0.0  Not Normal
0      Z        True      0.0  Not Normal

enter image description here

Disclaimer: methods used may not be statistically "canonical". One should be very careful when using statistical tools, since each one as its specific usage domain/use case.

I chose a 0.01 (1%) p-value, since it could be the upcoming standard value in scientific publications instead of the usual 0.05 (5%))

One should read https://en.wikipedia.org/wiki/Normality_test

Tests of univariate normality include the following:

  • D'Agostino's K-squared test,
  • Jarque–Bera test,
  • Anderson–Darling test,
  • Cramér–von Mises criterion,
  • Lilliefors test,
  • Kolmogorov–Smirnov test
  • Shapiro–Wilk test, and
  • Pearson's chi-squared test.

Code

Behavior may vary on your computer depending on RNG (random numbers generation). The following example is made with 5 normal random sampling and 5 pareto random sampling using numpy. The normality test performs well in these conditions (even if I feel that the 0.0 p value tests are suspicious even for a pareto random generation) Nevertheless, I think we can agree that it is about the method, not actual the results.

import pandas as pd
import numpy as np
import scipy
from scipy import stats
import seaborn as sb
import matplotlib.pyplot as plt
import sys

print('System: {}'.format(sys.version))
for module in [pd, np, scipy, sb]:
    print('Module {:10s} - version {}'.format(module.__name__, module.__version__))

nb_lines = 10000
headers_normal = 'ABCDE'
headers_pareto = 'VWXYZ'
reapeat_factor = 1
nb_cols = len(list(reapeat_factor * headers_normal))

df_normal = pd.DataFrame(np.random.randn(nb_lines, nb_cols), columns=list(reapeat_factor * headers_normal))

df_pareto = pd.DataFrame((np.random.pareto(12.0, size=(nb_lines,nb_cols )) + 15.) * 4., columns=list(reapeat_factor * headers_pareto))

df = df_normal.join(df_pareto)

alpha = 0.01
df_list = list()

# normality code taken from https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html
cat_map = {True: 'Not Normal',
           False: 'Maybe Normal'}
for col in df.columns:
    k2, p = stats.normaltest(df[col])
    is_not_normal = p < alpha
    tmp_df = pd.DataFrame({'Column': [col],
                           'Not_Normal': [is_not_normal],
                           'p-value': [p],
                           'Normality': cat_map[is_not_normal]
                           })
    df_list.append(tmp_df)

df_results = pd.concat(df_list)
df_results['Normality'] = df_results['Normality'].astype('category')

print('------------')
print('Columns names probably not a normal dist:')
# full data
print(df_results[(df_results['Normality'] == 'Not Normal')])
# only column names
# print(df_results[(df_results['Normality'] == 'Not Normal')]['Column'])
print('------------')
print('Plotting countplot')
sb.countplot(data=df_results, y='Normality', orient='v')
plt.show()

Outputs:

System: 3.7.2 (default, Feb 21 2019, 17:35:59) [MSC v.1915 64 bit (AMD64)]
Module pandas     - version 0.24.1
Module numpy      - version 1.16.2
Module scipy      - version 1.2.1
Module seaborn    - version 0.9.0
------------
Columns names probably not a normal dist:
  Column  Not_Normal  p-value   Normality
0      V        True      0.0  Not Normal
0      W        True      0.0  Not Normal
0      X        True      0.0  Not Normal
0      Y        True      0.0  Not Normal
0      Z        True      0.0  Not Normal
------------
Plotting countplot
like image 39
LoneWanderer Avatar answered Oct 18 '22 16:10

LoneWanderer


I really like Nathaniel's answer but I will add my two cents.

I would go for seaborn and in particular seaborn.distplot. This will allow you to easily fit a normal distribution to each histogram plot and make the visualization easier.

import seaborn as sns
from scipy.stats import norm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

cols = 1000
df = pd.DataFrame(np.random.normal(0, 1, [50, cols]))
from scipy.stats import norm
fig, ax = plt.subplots(figsize = (16, 10))
for i, col in enumerate(df.columns):
    ax=fig.add_subplot(25, 4, i+1)
    sns.distplot(df[col],fit=norm, kde=False,ax=ax)
plt.tight_layout()

Additionally, I am not sure if putting columns with the same name in your example was done on purpose. If that's the case the easiest solution to loop through the columns is to use .iloc and the code would look like this:

import seaborn as sns
from scipy.stats import norm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

 df = pd.DataFrame(np.random.randn(50, 1000), columns=list('ABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDED'))

fig, ax = plt.subplots(figsize = (12, 10))
for i, col in enumerate(df.columns):
    plt.subplot(25, 40, i+1)
    sns.distplot(df.iloc[:,i],fit=norm, kde=False,ax=plt.gca())
    plt.axis('off')
plt.tight_layout()

enter image description here

like image 21
CAPSLOCK Avatar answered Oct 18 '22 15:10

CAPSLOCK