Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Weird pdfs from Generalised Extreme Value (GEV) Maximum Likelihood fitted data

I am doing some data analysis involving fitting datasets to a Generalised Extreme Value (GEV) distribution, but I'm getting some weird results. Here's what I'm doing:

from scipy.stats import genextreme as gev
import numpy
data = [1.47, 0.02, 0.3, 0.01, 0.01, 0.02, 0.02, 0.12, 0.38, 0.02, 0.15, 0.01, 0.3, 0.24, 0.01, 0.05, 0.01, 0.0, 0.06, 0.01, 0.01, 0.0, 0.05, 0.0, 0.09, 0.03, 0.22, 0.0, 0.1, 0.0]
x = numpy.linspace(0, 2, 20)
pdf = gev.pdf(x, *gev.fit(data))
print(pdf)

And the output:

array([  5.64759709e+05,   2.41090345e+00,   1.16591714e+00,
         7.60085002e-01,   5.60415578e-01,   4.42145248e-01,
         3.64144425e-01,   3.08947114e-01,   2.67889183e-01,
         2.36190826e-01,   2.11002185e-01,   1.90520108e-01,
         1.73548832e-01,   1.59264573e-01,   1.47081601e-01,
         1.36572220e-01,   1.27416958e-01,   1.19372442e-01,
         1.12250072e-01,   1.05901466e-01,   1.00208313e-01,
         9.50751375e-02,   9.04240603e-02,   8.61909342e-02,
         8.23224528e-02,   7.87739599e-02,   7.55077677e-02,
         7.24918532e-02,   6.96988348e-02,   6.71051638e-02,
         6.46904782e-02,   6.24370827e-02,   6.03295277e-02,
         5.83542648e-02,   5.64993643e-02,   5.47542808e-02,
         5.31096590e-02,   5.15571710e-02,   5.00893793e-02,
         4.86996213e-02,   4.73819114e-02,   4.61308575e-02,
         4.49415891e-02,   4.38096962e-02,   4.27311763e-02,
         4.17023886e-02,   4.07200140e-02,   3.97810205e-02,
         3.88826331e-02,   3.80223072e-02])

The problem is that the first value is huge, totally distorting all the results, its show quite clearly in a plot: pdf

I've experimented with other data, and random samples, and in some cases it works. The first value in my dataset is significantly higher than the rest, but it is a valid value so I can't just drop it.

Does anyone have any idea why this is happening?

Update

Here is another example showing the problem much more clearly:

In [1]: from scipy.stats import genextreme as gev, kstest

In [2]: data = [0.01, 0.0, 0.28, 0.0, 0.0, 0.0, 0.01, 0.0, 0.0, 0.13, 0.07, 0.03
, 0.01, 0.42, 0.11, 0.0, 0.0, 0.0, 0.0, 0.25, 0.0, 0.0, 0.26, 1.32, 0.06, 0.02,
1.57, 0.07, 1.56, 0.04]

In [3]: fit = gev.fit(data)

In [4]: kstest(data, 'genextreme', fit)
Out[4]: (0.48015007915450658, 6.966510064376763e-07)

In [5]: x = linspace(0, 2, 200)

In [6]: plot(x, gev.pdf(x, *fit))
Out[6]: [<matplotlib.lines.Line2D at 0x97590f0>]

In [7]: hist(data)

Note specifically, line 4 shows a p-value of about 7e-7, way below what's normally considered acceptable. Here is the plot produced:

2nd Figure

like image 614
aquavitae Avatar asked Mar 04 '14 09:03

aquavitae


1 Answers

First, I think you may want to keep you location parameter fixed at 0.

Second, you got zeros in your data, the resulting fit may have +inf pdf at x=0 e.g. for the GEV fit or for the Weibull fit. Therefore, the fit is actually correct, but when you plot the pdf (including x=0), the resulting plot is distorted.

Third, I really think scipy should drop the support for x=0 for a bunch of distributions such as Weibull. For x=0, R gives a nice warning of Error in fitdistr(data, "weibull") : Weibull values must be > 0, that is helpful.

In [103]:

p=ss.genextreme.fit(data, floc=0)
ss.genextreme.fit(data, floc=0)
Out[103]:
(-1.372872096699608, 0, 0.011680600795499299)
In [104]:

plt.hist(data, bins=20, normed=True, alpha=0.7, label='Data')
plt.plot(np.linspace(5e-3, 1.6, 100),
         ss.genextreme.pdf(np.linspace(5e-3, 1.6, 100), p[0], p[1], p[2]), 'r--',
         label='GEV Fit')
plt.legend(loc='upper right')
plt.savefig('T1.png')

enter image description here

In [105]:

p=ss.expon.fit(data, floc=0)
ss.expon.fit(data, floc=0)
Out[105]:
(0, 0.14838807003769505)
In [106]:

plt.hist(data, bins=20, normed=True, alpha=0.7, label='Data')
plt.plot(np.linspace(0, 1.6, 100),
         ss.expon.pdf(np.linspace(0, 1.6, 100), p[0], p[1]), 'r--',
         label='Expon. Fit')
plt.legend(loc='upper right')
plt.savefig('T2.png')

enter image description here

In [107]:

p=ss.weibull_min.fit(data[data!=0], floc=0)
ss.weibull_min.fit(data[data!=0], floc=0)
Out[107]:
(0.67366030738733995, 0, 0.10535422201164378)
In [108]:

plt.hist(data[data!=0], bins=20, normed=True, alpha=0.7, label='Data')
plt.plot(np.linspace(5e-3, 1.6, 100),
         ss.weibull_min.pdf(np.linspace(5e-3, 1.6, 100), p[0], p[1], p[2]), 'r--',
         label='Weibull_Min Fit')
plt.legend(loc='upper right')
plt.savefig('T3.png')

enter image description here


edit

Your second data (which contains even more 0's )is a good example when MLE fit involving location parameter can become very challenging, especially potentially with a lot of float point overflow/underflow involved:

In [122]:
#fit with location parameter fixed, scanning loc parameter from 1e-8 to 1e1
L=[] #stores the Log-likelihood
P=[] #stores the p value of K-S test
for LC in np.linspace(-8, 1, 200):
    fit = gev.fit(data, floc=10**LC)
    L.append(np.log(gev.pdf(data, *fit)).sum())
    P.append(kstest(data, 'genextreme', fit)[1])
L=np.array(L)
P=np.array(P)
In [123]:
#plot log likelihood, a lot of overflow/underflow issues! (see the zigzag line?)
plt.plot(np.linspace(-8, 1, 200), L,'-')
plt.ylabel('Log-Likelihood')
plt.xlabel('$log_{10}($'+'location parameter'+'$)$')

enter image description here

In [124]:
#plot p-value
plt.plot(np.linspace(-8, 1, 200), np.log10(P),'-')
plt.ylabel('$log_{10}($'+'K-S test P value'+'$)$')
plt.xlabel('$log_{10}($'+'location parameter'+'$)$')
Out[124]:
<matplotlib.text.Text at 0x107e3050>

enter image description here

In [128]:
#The best fit between with location paramter between 1e-8 to 1e1 has the loglikelihood of 515.18
np.linspace(-8, 1, 200)[L.argmax()]
fit = gev.fit(data, floc=10**(np.linspace(-8, 1, 200)[L.argmax()]))
np.log(gev.pdf(data, *fit)).sum()
Out[128]:
515.17663678368604
In [129]:
#The simple MLE fit is clearly bad, loglikelihood is -inf
fit0 = gev.fit(data)
np.log(gev.pdf(data, *fit0)).sum()
Out[129]:
-inf

In [133]:
#plot the fit
x = np.linspace(0.005, 2, 200)
plt.plot(x, gev.pdf(x, *fit))
plt.hist(data,normed=True, alpha=0.6, bins=20)
Out[133]:
(array([ 8.91719745,  0.8492569 ,  0.        ,  1.27388535,  0.        ,
        0.42462845,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.42462845,  0.        ,  0.        ,  0.8492569 ]),
 array([ 0.    ,  0.0785,  0.157 ,  0.2355,  0.314 ,  0.3925,  0.471 ,
        0.5495,  0.628 ,  0.7065,  0.785 ,  0.8635,  0.942 ,  1.0205,
        1.099 ,  1.1775,  1.256 ,  1.3345,  1.413 ,  1.4915,  1.57  ]),
 <a list of 20 Patch objects>)

enter image description here


Edit, goodness of fit test for GEV

A side note on KS test. You are testing the goodness-of-fit to a GEV with its parameter ESTIMATED FROM THE DATA itself. In such a case, the p value is invalid, see: itl.nist.gov/div898/handbook/eda/section3/eda35g.htm

There seems to be a lot of studies on the topic of goodness-of-fit test for GEV, I haven't found any available implementations for those yet.

http://onlinelibrary.wiley.com/doi/10.1029/98WR02364/abstract http://onlinelibrary.wiley.com/doi/10.1029/91WR00077/abstract http://www.idrologia.polito.it/~laio/articoli/16-WRR%20EDFtest.pdf

like image 72
CT Zhu Avatar answered Sep 28 '22 09:09

CT Zhu