Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Kurtosis,Skewness of a bar graph? - Python

What is an efficient method for determining the skew/kurtosis of a bar graph in python? Considering that bar graphs are not binned (unlike histograms) this question would not make a lot of sense but what I am trying to do is to determine the symmetry of a graph's height vs distance (rather than frequency vs bins). In other words, given a value of heights(y) measured along distance(x) i.e.

y = [6.18, 10.23, 33.15, 55.25, 84.19, 91.09, 106.6, 105.63, 114.26, 134.24, 137.44, 144.61, 143.14, 150.73, 156.44, 155.71, 145.88, 120.77, 99.81, 85.81, 55.81, 49.81, 37.81, 25.81, 5.81]
x = [0.03, 0.08, 0.14, 0.2, 0.25, 0.31, 0.36, 0.42, 0.48, 0.53, 0.59, 0.64, 0.7, 0.76, 0.81, 0.87, 0.92, 0.98, 1.04, 1.09, 1.15, 1.2, 1.26, 1.32, 1.37]

What is the symmetry of that height(y) distribution (skewness) and peakness (kurtosis) as measured over distance(x)? Are skewness/kurtosis appropriate measurements for determining the normal distribution of real values? Or does scipy/numpy offer something similar for that type of measurement?

I can achieve a skew/kurtosis estimate of height(y) frequency values binned along distance(x) by the following

freq=list(chain(*[[x_v]*int(round(y_v)) for x_v,y_v in zip(x,y)]))
x.extend([x[-1:][0]+x[0]])          #add one extra bin edge 
hist(freq,bins=x)
ylabel("Height Frequency")
xlabel("Distance(km) Bins")
print "Skewness,","Kurtosis:",stats.describe(freq)[4:]

Skewness, Kurtosis: (-0.019354300509997705, -0.7447085398785758)

Histogram

In this case the height distribution is symmetrical (skew 0.02) around the midpoint distance and characterized by a platykurtic (-0.74 kurtosis i.e. broad) distribution.

Considering that I multiply each occurrence of x value by their height y to create a frequency, the size of the result list can sometimes get very large. I was wondering if there was a better method to approach this problem? I suppose that I could always try to normalize dataset y to a range of perhaps 0 - 100 without loosing too much information on the datasets skew/kurtosis.

like image 636
GeoPy Avatar asked Jul 11 '13 08:07

GeoPy


1 Answers

This isn't a python question, nor is it really a programming question but the answer is simple nonetheless. Instead of skew and kurtosis, let's first consider the easier values based off the lower moments, the mean and standard deviation. To make it concrete, and to fit with your question, let's assume your data looks like:

X = 3, 3, 5, 5, 5, 7 = x1, x2, x3 ....

Which would give a "bar graph" that looks like:

{3:2, 5:3, 7:1} = {k1:p1, k2:p2, k3:p3}

The mean, u, is given by

E[X] = (1/N) * (x1 + x2 + x3 + ...) = (1/N) * (3 + 3 + 5 + ...)

Our data, however, has repeated values, so this can be rewritten as

E[X] = (1/N) * (p1*k1 + p2*k2 + ...) = (1/N) * (3*2 + 5*3 + 7*1)

The next term, the standard dev., s, is simply

sqrt(E[(X-u)^2]) = sqrt((1/N)*( (x1-u)^2 + (x2-u)^3 + ...))

But we can apply the same reduction to the E[(X-u)^2] term and write it as

E[(X-u)^2] = (1/N)*( p1*(k1-u)^2 + p2*(k2-u)^2 + ... )
           = (1/6)*( 2*(3-u)^2 + 3*(5-u)^2 + 1*(7-u)^2 )

Which means we don't have to have a multiple copy of each data item to do the sum as you indicated in your question.

The skew and kurtosis are quite simple as this point:

skew     = E[(x-u)^3] / (E[(x-u)^2])^(3/2)
kurtosis = ( E[(x-u)^4] / (E[(x-u)^2])^2 ) - 3
like image 114
Hooked Avatar answered Sep 23 '22 18:09

Hooked