Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Log Normal Random Variables with Scipy

Tags:

python

scipy

I fail to understand the very basics of creating lognormal variables as documented here.

The log normal distribution takes on mean and variance as parameters. I would like to create a frozen distribution using these parameters and then get cdf, pdf etc.

However, in the documentation, they get the frozen distribution using

from scipy.stats import lognorm
s = 0.953682269606
rv = lognorm(s)

's' seems to be the standard deviation. I tried to use the 'loc' and 'scale' parameters instead of 's', but that generated an error (s is a required parameter). How can I generate a frozen distribution with parameter values 'm', 's' for location and scale?

like image 805
FooBar Avatar asked Feb 24 '15 16:02

FooBar


People also ask

How do you generate a lognormal distribution in Python?

You can use the lognorm() function from the SciPy library in Python to generate a random variable that follows a log-normal distribution.

What is a log-normal random variable?

A log-normal distribution is a continuous distribution of random variable whose natural logarithm is normally distributed. For example, if random variable y = exp { y } has log-normal distribution then x = log ( y ) has normal distribution.

How do you generate log-normal data?

The method is simple: you use the RAND function to generate X ~ N(μ, σ), then compute Y = exp(X). The random variable Y is lognormally distributed with parameters μ and σ. This is the standard definition, but notice that the parameters are specified as the mean and standard deviation of X = log(Y).

Is log-normal the same with normal distribution?

The lognormal distribution differs from the normal distribution in several ways. A major difference is in its shape: the normal distribution is symmetrical, whereas the lognormal distribution is not. Because the values in a lognormal distribution are positive, they create a right-skewed curve.


1 Answers

The mystery is solved (edit 3)

  • μ corresponds to ln(scale) (!)
  • σ corresponds to shape (s)
  • loc is not needed for setting any of σ and μ

I think it is a severe problem that this is not clearly documented. I guess many have fallen for this when doing simple tests with the lognormal distribution in SciPy.

Why is that?

The stats module treats loc and scale the same for all distributions (this is not explicitly written down, but can be inferred when reading between the lines). My suspicion was that loc is substracted from x, and the result is divided by scale (and the result is treated as the new x). I tested for that, and this turned out to be the case.

What does it mean for the lognormal distribution? In the canonical definition of the lognormal distribution the term ln(x) appears. Obviously, the same term appears in SciPy's implementation. With above's considerations, this is how loc and scale end up in the logarithm:

ln((x-loc)/scale)

By common logarithm calculus, this is the same as

ln(x-loc) - ln(scale)

In the canonical definition of the lognormal distribution the term simply is ln(x) - μ. Comparing SciPy's approach and the canonical approach then provides the crucial insight: ln(scale) represents μ. loc, however, has no correspondence in the canonical definition and is better left at 0. Further below, I have argued for the fact that shape (s) is σ.

Proof

>>> import math
>>> from scipy.stats import lognorm
>>> mu = 2
>>> sigma = 2
>>> l = lognorm(s=sigma, loc=0, scale=math.exp(mu))
>>> print("mean: %.5f  stddev: %.5f" % (l.mean(), l.std()))
mean: 54.59815  stddev: 399.71719

I use WolframAlpha as a reference. It provides analytically determined values for the mean and standard deviation of the lognormal distribution.

http://www.wolframalpha.com/input/?i=log-normal+distribution%2C+mean%3D2%2C+sd%3D2

enter image description here

The values match.

WolframAlpha as well as SciPy come up with the mean and standard deviation by evaluating analytical terms. Let's perform an empirical test, by taking many samples from the SciPy distribution, and calculate their mean and standard deviation "manually" (from the whole set of samples):

>>> import numpy as np
>>> samples = l.rvs(size=2*10**7)
>>> print("mean: %.5f  stddev: %.5f" % (np.mean(samples), np.std(samples)))
mean: 54.52148  stddev: 380.14457

This is still not perfectly converged, but I think it is proof enough that the samples correspond to the same distribution that WolframAlpha assumed, given μ=2 and σ=2.

And another small edit: it looks like proper usage of a search engine would have helped, we were not the first to be trapped by this:

https://stats.stackexchange.com/questions/33036/fitting-log-normal-distribution-in-r-vs-scipy http://nbviewer.ipython.org/url/xweb.geos.ed.ac.uk/~jsteven5/blog/lognormal_distributions.ipynb scipy, lognormal distribution - parameters

Another edit: now that I know how it behaves, I realize that be behavior in principle is documented. In the "notes" section we can read:

with shape parameter sigma and scale parameter exp(mu)

It is just really not obvious (we both were not able to appreciate the importance of this small sentence). I guess the reason that we could not understand what this sentence means is that the analytical expression shown in the notes section does not include loc and scale. I guess this is worth a bug report / documentation improvement.

Original answer:

Indeed, the shape parameter topic is not well-documented when looking into the docs page for a particular distribution. I recommend having a look at the main stats documentation -- there is a section on shape parameters:

http://docs.scipy.org/doc/scipy/reference/tutorial/stats.html#shape-parameters

It looks like there should be a lognorm.shapes property, telling you about what the s parameter means, specifically.

Edit: There is only one parameter, indeed:

>>> lognorm.shapes
's'

When comparing the general definition of the lognormal distribution (from Wikipedia): enter image description here

and the formula given by the scipy docs:

lognorm.pdf(x, s) = 1 / (s*x*sqrt(2*pi)) * exp(-1/2*(log(x)/s)**2)

it becomes obvious that s is the true σ (sigma).

However, from the docs it is not obvious how the loc parameter is related to μ (mu).

It could be as in ln(x-loc), which would not correspond to μ in the general formula, or it could be ln(x)-loc, which would ensure correspondence between loc and μ. Try it out! :)

Edit 2

I have made comparisons between what WolframAlpha (WA) and SciPy say. WA is pretty clear about that it uses μ and σ as generally understood (as defined in linked Wikipedia article).

>>> l = lognorm(s=2, loc=0)
>>> print("mean: %.5f  stddev: %.5f" % (l.mean(), l.std()))
mean: 7.38906  stddev: 54.09584

This matches WA's output.

Now, for loc not being zero, there is a mismatch. Example:

>>> l = lognorm(s=2, loc=1)
>>> print("mean: %.5f  stddev: %.5f" % (l.mean(), l.std()))
mean: 8.38906  stddev: 54.09584

WA gives a mean of 20.08 and a standard deviation of 147. There you have it, loc does not correspond to μ in the classical definition of the lognormal distribution.

like image 111
Dr. Jan-Philip Gehrcke Avatar answered Oct 20 '22 11:10

Dr. Jan-Philip Gehrcke