Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

normality test of a distribution in python

I have some data I have sampled from a radar satellite image and wanted to perform some statistical tests on. Before this I wanted to conduct a normality test so I could be sure my data was normally distributed. My data appears to be normally distributed but when I perform the test Im getting a Pvalue of 0, suggesting my data is not normally distributed.

I have attached my code along with the output and a histogram of the distribution (Im relatively new to python so apologies if my code is clunky in any way). Can anyone tell me if Im doing something wrong - I find it hard to believe from my histogram that my data is not normally distributed?

values = 'inputfile.h5'
f = h5py.File(values,'r')
dset = f['/DATA/DATA']
array = dset[...,0]
print('normality =', scipy.stats.normaltest(array))
max = np.amax(array)
min = np.amin(array)

histo = np.histogram(array, bins=100, range=(min, max))
freqs = histo[0]
rangebins = (max - min)
numberbins = (len(histo[1])-1)
interval = (rangebins/numberbins)
newbins = np.arange((min), (max), interval)
histogram = bar(newbins, freqs, width=0.2, color='gray')
plt.show()

This prints this: (41099.095955202931, 0.0). the first element is a chi-square value and the second is a pvalue.

I have made a graph of the data which I have attached. I thought that maybe as Im dealing with negative values it was causing a problem so I normalised the values but the problem persists.

histogram of values in array

like image 577
Nathan Thomas Avatar asked Mar 04 '14 17:03

Nathan Thomas


People also ask

How do you test normality of distribution?

For quick and visual identification of a normal distribution, use a QQ plot if you have only one variable to look at and a Box Plot if you have many. Use a histogram if you need to present your results to a non-statistical public. As a statistical test to confirm your hypothesis, use the Shapiro Wilk test.

What is Shapiro test in Python?

Shapiro-Wilk test is a test of normality, it determines whether the given sample comes from the normal distribution or not. Shapiro-Wilk's test or Shapiro test is a normality test in frequentist statistics. The null hypothesis of Shapiro's test is that the population is distributed normally.

What is Normaltest Python?

In Python, scipy. stats. normaltest is used to test this. It gives the statistic which is s^2 + k^2, where s is the z-score returned by skew test and k is the z-score returned by kurtosis test and p-value, i.e., 2-sided chi squared probability for the hypothesis test.


2 Answers

This question explains why you're getting such a small p-value. Essentially, normality tests almost always reject the null on very large sample sizes (in yours, for example, you can see just some skew in the left side, which at your enormous sample size is way more than enough).

What would be much more practically useful in your case is to plot a normal curve fit to your data. Then you can see how the normal curve actually differs (for example, you can see whether the tail on the left side does indeed go too long). For example:

from matplotlib import pyplot as plt
import matplotlib.mlab as mlab

n, bins, patches = plt.hist(array, 50, normed=1)
mu = np.mean(array)
sigma = np.std(array)
plt.plot(bins, mlab.normpdf(bins, mu, sigma))

(Note the normed=1 argument: this ensures that the histogram is normalized to have a total area of 1, which makes it comparable to a density like the normal distribution).

like image 132
David Robinson Avatar answered Sep 17 '22 13:09

David Robinson


In general when the number of samples is less than 50, you should be careful about using tests of normality. Since these tests need enough evidences to reject the null hypothesis, which is "the distribution of the data is normal", and when the number of samples is small they are not able to find those evidences.

Keep in mind that when you fail to reject the null hypothesis it does not mean that the alternative hypothesis is correct.

There is another possibility that: Some implementations of the statistical tests for normality compare the distribution of your data to standard normal distribution. In order to avoid this, I suggest you to standardize the data and then apply the test of normality.

like image 21
2 revs Avatar answered Sep 17 '22 13:09

2 revs