Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can check the distribution of a variable in python? [closed]

In a uni-testing I need to check the distribution of the values ​​of an array is uniform. For example:

in an array = [1, 0, 1, 0, 1, 1, 0, 0] there is a uniform distribution of values. Since there are four "1" and four "0"

For larger lengths of the array, the distribution is more "uniform"

How do I prove that the array that is testing has a uniform distribution?

note: the array is created with random.randint(min,max,len), from numpy.random

like image 622
eduardo.sufan Avatar asked Mar 13 '14 22:03

eduardo.sufan


People also ask

How to visualize data distribution of a categorical variable in Python?

How to visualize data distribution of a categorical variable in Python. Bar charts can be used in many ways, one of the common use is to visualize the data distribution of categorical variables in data. X-axis being the unique category values and Y-axis being the frequency of each value.

Is there a chi-square test for continuous distribution in Python?

for continuous distributions there is Kolmogorov–Smirnov test; for discrete distributions there is a Chi-square test – behzad.nouri Mar 13 '14 at 23:17 4 @jonrsharpe, I don't agree. The question is about how to do it in Python.

What is the distribution of values in an array of numbers?

in an array = [1, 0, 1, 0, 1, 1, 0, 0]there is a uniform distribution of values. Since there are four "1" and four "0"

How do you determine if an array has a uniform distribution?

in an array = [1, 0, 1, 0, 1, 1, 0, 0]there is a uniform distribution of values. Since there are four "1" and four "0" For larger lengths of the array, the distribution is more "uniform" How do I prove that the array that is testing has a uniform distribution?


1 Answers

You can use Kolmogorove-Smirnov Test for continues and discrete distributions. This function is provided with scipy.stats.kstest http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html#scipy.stats.kstest.

In [12]:

import scipy.stats as ss
import numpy as np
In [14]:

A=np.random.randint(0,10,100)
In [16]:

ss.kstest(A, ss.randint.cdf, args=(0,10))
#args is a tuple containing the extra parameter required by ss.randint.cdf, in this case, lower bound and upper bound
Out[16]:
(0.12, 0.10331653831438881)
#This a tuple of two values; KS test statistic, either D, D+ or D-. and p-value

Here the resulting P value is 0.1033, we therefore conclude that the array A is not significantly different from a uniform distribution. The way to think about the P value is, it measures the probability of getting the test statistic as extreme as the one observed (here: the first number in the tuple) assuming the null hypothesis is true. In KS test, we actually has the null hypothesis that A is not different from a uniform distribution. A p value of 0.1033 is often not considered as extreme enough to reject the null hypothesis. Usually the P value has to be less than 0.05 or 0.01 in order to reject the null. If this p value in this example is less than 0.05, we will then say A is significantly different from a uniform distribution.

The alternative method of using scipy.stats.chisquare():

In [17]:

import scipy.stats as ss
import numpy as np
In [18]:

A=np.random.randint(0, 10, 100)
In [19]:

FRQ=(A==np.arange(10)[...,np.newaxis]).sum(axis=1)*1./A.size #generate the expect frequecy table.
In [20]:

ss.chisquare(FRQ) #If not specified, the default expected frequency is uniform across categories.
Out[20]:
(0.084000000000000019, 0.99999998822800984)

The first value is chisquare and the second value is P value.

like image 92
CT Zhu Avatar answered Oct 06 '22 00:10

CT Zhu