Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

what is the difference between scipy.stats module and numpy.random module, between similar methods that both modules have?

I was going over some distribution functions at python:

Uniform, Binomial, Bernoulli, normal distributions

I found that pretty much the same functions are present in both scipy and numpy.

>>> from scipy.stats import binom
>>> rv = binom(n, p)

>>> import numpy as np
>>> s = np.random.binomial(n, p, 1000)

Going over the code I found scipy uses numpy internally :

https://github.com/scipy/scipy/blob/master/scipy/stats/_discrete_distns.py

https://github.com/numpy/numpy/blob/master/numpy/random/mtrand/distributions.c

So, my question is what is the primary motive to have 2 copies of the same distribution functions?

what additional functionality is provided by scipy library that is not there in numpy?

Complete list of methods in each module is here:

Numpy Random module: https://docs.scipy.org/doc/numpy/reference/routines.random.html

Scipy stats module: https://docs.scipy.org/doc/scipy/reference/stats.html

I found reference to some basic difference between the 2 modules: Difference between random draws from scipy.stats....rvs and numpy.random

like image 446
Vikash Singh Avatar asked Jun 29 '17 06:06

Vikash Singh


2 Answers

scipy generates a random variable while numpy generates random numbers. When you use np.random.binomial(n, p, 1), it is just a realization of the random variable (binom(n, p)):

In probability and statistics, a realization, or observed value, of a random variable is the value that is actually observed (what actually happened). The random variable itself is the process dictating how the observation comes about. Statistical quantities computed from realizations without deploying a statistical model are often called "empirical", as in empirical distribution function or empirical probability.

In general, what numpy does is to roll a dice several times. scipy, on the other hand, tells you what is the probability of getting two sixes in a row. What is the expected number of tails if you flip a coin a hundred times.

Of course you can run a simulation in numpy and approximate these values (flip a coin one million time and the number of tails will be approximately 500 thousand). However, this is just a result of an experiment. A random variable tells you the theoretical solution (for binomial, this is n times p where n is the number of trials and p is the probability. So you would get exactly 500 thousand.


Here is a little demo:

import scipy.stats as ss
import numpy as np

n, p = 10**4, 0.3
rv  = ss.binom(n, p)

Get the mean and the standard deviation of the random variable:

rv.mean()
Out: 3000.0

rv.std()
Out: 45.825756949558397

Generate 100 random numbers from that distribution:

prng = np.random.RandomState(0)    
random_numbers = prng.binomial(n, p, size=100)

Calculate the mean and the standard deviation:

random_numbers.mean()
Out: 3004.8099999999999
random_numbers.std()
Out: 47.336813369723146

Generate another 100:

prng = np.random.RandomState(1)
random_numbers = prng.binomial(n, p, size=100)

Different mean and standard deviation:

random_numbers.mean()
Out: 2990.96

random_numbers.std()
Out: 46.245631145006548

The further you increase the sample size, the mean and the standard deviation will approach to the distribution mean and the distribution standard deviation:

random_numbers = prng.binomial(n, p, size=10**7)

random_numbers.mean()
Out: 2999.9639155

random_numbers.std()
Out: 45.854409513250303
like image 150
ayhan Avatar answered Oct 11 '22 07:10

ayhan


what additional functionality is provided by scipy library that is not there in numpy?

You can see the additional functionality if you look at the documentation for one of the individual distributions (e.g., beta). The numpy functions only allow drawing random values. The scipy distributions have lots of extra methods for other things, like percentiles, cumulative distribution function, and statistics like the mean and standard deviation.

Some of the information that scipy gives you is not computable directly from the numpy functions. The numpy functions only give you individual randomly-drawn values, but scipy represents the distribution mathematically and can compute some things without actually drawing any values. For instance, many of the stats that the scipy distributions return are computed with exact mathematical formulas. You can see in the source you linked to that, e.g., binom_gen._stats computes the mean, stdev, etc. directly. To find the mean using numpy you'd have to draw a bunch of values (theoretically an infinite number) and compute their mean; scipy does it abstractly without drawing any values. The scipy distributions expose mathematical details of the distributions that aren't available through numpy.

like image 32
BrenBarn Avatar answered Oct 11 '22 06:10

BrenBarn