Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

generate random lognormal distributions using shape of observed data

Tags:

python

scipy

I'm trying to fit some data to a lognormal distribution and from this generate random lognormal distribution using optimized parameters. After some search I found some solutions, but none convincing:

solution1 using the fit function:

import  numpy as np
from scipy.stats      import lognorm

mydata = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,8,8,8,8,8,9,9,9,10,10,11,12,13,14,14,15,19,19,21,23,25,27,28,30,31,36,41,45,48,52,55,60,68,75,86,118,159,207,354]

shape, loc, scale = lognorm.fit(mydata)
rnd_log = lognorm.rvs (shape, loc=loc, scale=scale, size=100)

or Solution 2 using mu and sigma from original data:

import  numpy as np
from scipy.stats      import lognorm

mydata = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,8,8,8,8,8,9,9,9,10,10,11,12,13,14,14,15,19,19,21,23,25,27,28,30,31,36,41,45,48,52,55,60,68,75,86,118,159,207,354]

mu    = np.mean([np.log(i) for i in mydata])
sigma = np.std([np.log(i) for i in mydata])

distr   = lognorm(mu, sigma)
rnd_log = distr.rvs (size=100)

None of those solutions are fitting well:

import pylab
pylab.plot(sorted(mydata, reverse=True), 'ro')
pylab.plot(sorted(rnd_log, reverse=True), 'bx')

I am not sure if i understand well the way to use distributions, or if I am missing something else...

I though finding the solution here: Does anyone have example code of using scipy.stats.distributions? but I am not able to get the shape from my data... am I missing something in the use of the fit function?

thanks

EDIT:

this is an example in order to understand better my problem:

print 'solution 1:'
means = []
stdes = []
distr   = lognorm(mu, sigma)
for _ in xrange(1000):
    rnd_log = distr.rvs (size=100)
    means.append (np.mean([np.log(i) for i in rnd_log]))
    stdes.append (np.std ([np.log(i) for i in rnd_log]))
print 'observed mean:',mu   , 'mean simulated mean:', np.mean (means)
print 'observed std :',sigma, 'mean simulated std :', np.mean (stdes)

print '\nsolution 2:'
means = []
stdes = []
shape, loc, scale = lognorm.fit(mydata)
for _ in xrange(1000):
    rnd_log = lognorm.rvs (shape, loc=loc, scale=scale, size=100)
    means.append (np.mean([np.log(i) for i in rnd_log]))
    stdes.append (np.std ([np.log(i) for i in rnd_log]))
print 'observed mean:',mu   , 'mean simulated mean:', np.mean (means)
print 'observed std :',sigma, 'mean simulated std :', np.mean (stdes)

the result is:

solution 1:
observed mean: 1.82562655734 mean simulated mean: 1.18929982267
observed std : 1.39003773799 mean simulated std : 0.88985924363

solution 2:
observed mean: 1.82562655734 mean simulated mean: 4.50608084668
observed std : 1.39003773799 mean simulated std : 5.44206119499

while, if I do the same in R:

mydata <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,8,8,8,8,8,9,9,9,10,10,11,12,13,14,14,15,19,19,21,23,25,27,28,30,31,36,41,45,48,52,55,60,68,75,86,118,159,207,354)
meanlog <- mean(log(mydata))
sdlog <- sd(log(mydata))
means <- c()
stdes <- c()
for (i in 1:1000){
  rnd.log <- rlnorm(length(mydata), meanlog, sdlog)
  means <- c(means, mean(log(rnd.log)))
  stdes <- c(stdes, sd(log(rnd.log)))
}

print (paste('observed mean:',meanlog,'mean simulated mean:',mean(means),sep=' '))
print (paste('observed std :',sdlog  ,'mean simulated std :',mean(stdes),sep=' '))

i get:

[1] "observed mean: 1.82562655733507 mean simulated mean: 1.82307191072317"
[1] "observed std : 1.39704049131865 mean simulated std : 1.39736545866904"

that is much more closer, so I guess I am doing something wrong when using scipy...

like image 746
fransua Avatar asked Oct 10 '22 02:10

fransua


1 Answers

The lognormal distribution in scipy is parametrized a little different from the usual way. See the scipy.stats.lognorm docs, particularly the "Notes" section.

Here's how to get the results you're expecting (note that we hold location to 0 when fitting):

In [315]: from scipy import stats

In [316]: x = np.array([1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,8,8,8,8,8,9,9,9,10,10,11,12,13,14,14,15,19,19,21,23,25,27,28,30,31,36,41,45,48,52,55,60,68,75,86,118,159,207,354])

In [317]: mu, sigma = stats.norm.fit(np.log(x))

In [318]: mu, sigma
Out[318]: (1.8256265573350701, 1.3900377379913127)

In [319]: shape, loc, scale = stats.lognorm.fit(x, floc=0)

In [320]: np.log(scale), shape
Out[320]: (1.8256267737298788, 1.3900309739954713)

Now you can generate samples and confirm your expectations:

In [321]: dist = stats.lognorm(shape, loc, scale)

In [322]: means, sds = [], []

In [323]: for i in xrange(1000):
   .....:     sample = dist.rvs(size=100)
   .....:     logsample = np.log(sample)
   .....:     means.append(logsample.mean())
   .....:     sds.append(logsample.std())
   .....:

In [324]: np.mean(means), np.mean(sds)
Out[324]: (1.8231068508345041, 1.3816361818739145)
like image 176
ars Avatar answered Oct 12 '22 23:10

ars