Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python scipy stats pareto fit: how does it work

... help and online documentation say the function scipy.stats.pareto.fit takes as variables the dataset to be fitted, and optionally b (the exponent), loc, scale. the result comes as triplet (exponent, loc, scale)

generating data from the same distribution should result in the fit finding the parameters used for generating the data, e.g. (using the python 3 colsole)

$  python
Python 3.3.0 (default, Dec 12 2012, 07:43:02) 
[GCC 4.7.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

(in code lines below leaving out the python console prompt ">>>")

dataset=scipy.stats.pareto.rvs(1.5,size=10000)  #generating data
scipy.stats.pareto.fit(dataset)

however this results in

(1.0, nan, 0.0)

(exponent 1, should be 1.5) and

dataset=scipy.stats.pareto.rvs(1.1,size=10000)  #generating data
scipy.stats.pareto.fit(dataset)

results in

(1.0, nan, 0.0)

(exponent 1, should be 1.1) and

dataset=scipy.stats.pareto.rvs(4,loc=2.0,scale=0.4,size=10000)    #generating data
scipy.stats.pareto.fit(dataset)

(exponent should be 4, loc should be 2, scale should be 0.4) in

(1.0, nan, 0.0)

etc. giving another exponent when calling the fit function

scipy.stats.pareto.fit(dataset,1.4)

returns always exactly this exponent

(1.3999999999999999, nan, 0.0)

The obvious question would be: do I misunderstand the purpose of this fit function completely, is it used somehow differently, or is it simply broken?

a remark: before someone mentions that dedicated functions like those given on Aaron Clauset's web pages (http://tuvalu.santafe.edu/~aaronc/powerlaws/) are more reliable than the scipy.stats methods and should be used instead: that may be true, but they are also very very very very time consuming and do for datasets of 10000 points take many many hours (maybe days, weeks, years) on a normal PC.

edit: oh: the parameter of the fit function is not the exponent of the distribution but exponent minus 1 (but this does not change the above issue)

like image 319
0range Avatar asked Dec 26 '22 07:12

0range


1 Answers

It looks like you must supply a guess for the loc and scale:

In [78]: import scipy.stats as stats

In [79]: b, loc, scale = 1.5, 0, 1

In [80]: data = stats.pareto.rvs(b, size=10000)

In [81]: stats.pareto.fit(data, 1, loc=0, scale=1)
Out[81]: (1.5237427002368424, -2.8457847787917788e-05, 1.0000329980475393)

and the guess has to be pretty accurate for the fit to succeed:

In [82]: stats.pareto.fit(data, 1, loc=0, scale=1.01)
Out[82]: (1.5254113096223709, -0.0015898489208676779, 1.0015943893384001)

In [83]: stats.pareto.fit(data, 1, loc=0, scale=1.05)
Out[83]: (1.5234726749064218, 0.00025804526532994751, 0.99974649559141171)

In [84]: stats.pareto.fit(data, 1, loc=0.05, scale=1.05)
Out[84]: (1.0, 0.050000000000000003, 1.05)

Hopefully the context of the problem will inform you what an appropriate guess for the loc and scale should be. Most likely, loc=0 and scale=1.

like image 141
unutbu Avatar answered Dec 29 '22 03:12

unutbu