Generate the dataframe for replicability:
df = pd.DataFrame(np.random.randn(50, 1000), columns=list('ABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDED'))
Check for normalcy of distribution of each variable (note: this takes a long time to run)
# Set the column names
columns= df.columns
# Loop over all columns
fig, axs = plt.subplots(len(df.columns), figsize=(5, 25))
for n, col in enumerate(df.columns):
df[col].hist(ax=axs[n])
Result generates illegible histograms and takes a very long time to run.
The length of time is okay, but I am curious if anyone has suggestions for generating legible histograms (do not have to be fancy), which can be quickly reviewed for the entire dataframe to ensure the normality of the distributions.
This code generates 1000 histograms and allows you to see each one in sufficient detail to understand how normally-distributed the columns are:
import pandas as pd
import matplotlib.pyplot as plt
cols = 1000
df = pd.DataFrame(np.random.normal(0, 1, [50, cols]))
# Loop over all columns
fig, ax = plt.subplots(figsize = (16, 10))
for n, col in enumerate(df.columns):
plt.subplot(25, 40, n+1)
df[col].hist(ax = plt.gca())
plt.axis('off')
plt.tight_layout()
plt.savefig('1000_histograms.png', bbox_inches='tight', pad_inches = 0, dpi = 200)
Another way to ascertain normality is with a QQ plot, which may be easier to visualize in bulk compared to a histogram:
import statsmodels.api as sm
cols = 1000
df = pd.DataFrame(np.random.normal(0,1, [50, cols]))
fig, axs = plt.subplots(figsize=(18, 12))
for n, col in enumerate(df.columns):
plt.subplot(25,40,n+1)
sm.qqplot(df[col], ax=plt.gca(), #line='45',
marker='.', markerfacecolor='C0', markeredgecolor='C0',
markersize=2)
# sm.qqline(ax=plt.gca(), line='45', fmt='lightgray')
plt.axis('off')
plt.savefig('1000_QQ_plots13.png', bbox_inches='tight', pad_inches=0, dpi=200)
The closer each line is to a 45 degree diagonal, the more normally-distributed the column data is.
As discussed in comments below, the OP question has changed to thousands of plots management. From that perspective, Nathaniel answer's is appropriate.
However, I felt that the unsaid intent was to decide wheter a given variable was normally distributed or not, with thousands+ variables to consider.
Check for normalcy of distribution of each variable (note: this takes a long time to run)
With that in mind, it appears (to me) that having a human reviewing thousands of plots to spot normal/non-normal distributions is an innapropriate method. There is a french idiom for this: "usine à gaz" ("gas factory")
Therefore, this answer focuses on performing the analysis programmatically and provide some kind of more concise report.
Perform analysis of data normality over a huge number of columns. It relies on the suggestion expressed in this answer.
The idea is to:
With this method, we can further use programming to manipulate the normal/non-normal columns. For instance, we could perform additional distribution tests, or plot only the non normal distribution, thus reducing the number of graphs to actually observe.
------------
Columns probably not a normal dist:
Column Not_Normal p-value Normality
0 V True 0.0 Not Normal
0 W True 0.0 Not Normal
0 X True 0.0 Not Normal
0 Y True 0.0 Not Normal
0 Z True 0.0 Not Normal
Disclaimer: methods used may not be statistically "canonical". One should be very careful when using statistical tools, since each one as its specific usage domain/use case.
I chose a 0.01 (1%) p-value, since it could be the upcoming standard value in scientific publications instead of the usual 0.05 (5%))
One should read https://en.wikipedia.org/wiki/Normality_test
Tests of univariate normality include the following:
Behavior may vary on your computer depending on RNG (random numbers generation). The following example is made with 5 normal random sampling and 5 pareto random sampling using numpy. The normality test performs well in these conditions (even if I feel that the 0.0 p value tests are suspicious even for a pareto random generation) Nevertheless, I think we can agree that it is about the method, not actual the results.
import pandas as pd
import numpy as np
import scipy
from scipy import stats
import seaborn as sb
import matplotlib.pyplot as plt
import sys
print('System: {}'.format(sys.version))
for module in [pd, np, scipy, sb]:
print('Module {:10s} - version {}'.format(module.__name__, module.__version__))
nb_lines = 10000
headers_normal = 'ABCDE'
headers_pareto = 'VWXYZ'
reapeat_factor = 1
nb_cols = len(list(reapeat_factor * headers_normal))
df_normal = pd.DataFrame(np.random.randn(nb_lines, nb_cols), columns=list(reapeat_factor * headers_normal))
df_pareto = pd.DataFrame((np.random.pareto(12.0, size=(nb_lines,nb_cols )) + 15.) * 4., columns=list(reapeat_factor * headers_pareto))
df = df_normal.join(df_pareto)
alpha = 0.01
df_list = list()
# normality code taken from https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html
cat_map = {True: 'Not Normal',
False: 'Maybe Normal'}
for col in df.columns:
k2, p = stats.normaltest(df[col])
is_not_normal = p < alpha
tmp_df = pd.DataFrame({'Column': [col],
'Not_Normal': [is_not_normal],
'p-value': [p],
'Normality': cat_map[is_not_normal]
})
df_list.append(tmp_df)
df_results = pd.concat(df_list)
df_results['Normality'] = df_results['Normality'].astype('category')
print('------------')
print('Columns names probably not a normal dist:')
# full data
print(df_results[(df_results['Normality'] == 'Not Normal')])
# only column names
# print(df_results[(df_results['Normality'] == 'Not Normal')]['Column'])
print('------------')
print('Plotting countplot')
sb.countplot(data=df_results, y='Normality', orient='v')
plt.show()
Outputs:
System: 3.7.2 (default, Feb 21 2019, 17:35:59) [MSC v.1915 64 bit (AMD64)]
Module pandas - version 0.24.1
Module numpy - version 1.16.2
Module scipy - version 1.2.1
Module seaborn - version 0.9.0
------------
Columns names probably not a normal dist:
Column Not_Normal p-value Normality
0 V True 0.0 Not Normal
0 W True 0.0 Not Normal
0 X True 0.0 Not Normal
0 Y True 0.0 Not Normal
0 Z True 0.0 Not Normal
------------
Plotting countplot
I really like Nathaniel's answer but I will add my two cents.
I would go for seaborn and in particular seaborn.distplot. This will allow you to easily fit a normal distribution to each histogram plot and make the visualization easier.
import seaborn as sns
from scipy.stats import norm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
cols = 1000
df = pd.DataFrame(np.random.normal(0, 1, [50, cols]))
from scipy.stats import norm
fig, ax = plt.subplots(figsize = (16, 10))
for i, col in enumerate(df.columns):
ax=fig.add_subplot(25, 4, i+1)
sns.distplot(df[col],fit=norm, kde=False,ax=ax)
plt.tight_layout()
Additionally, I am not sure if putting columns with the same name in your example was done on purpose. If that's the case the easiest solution to loop through the columns is to use .iloc
and the code would look like this:
import seaborn as sns
from scipy.stats import norm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randn(50, 1000), columns=list('ABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDED'))
fig, ax = plt.subplots(figsize = (12, 10))
for i, col in enumerate(df.columns):
plt.subplot(25, 40, i+1)
sns.distplot(df.iloc[:,i],fit=norm, kde=False,ax=plt.gca())
plt.axis('off')
plt.tight_layout()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With