Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How come "median" is 2x faster than "mean" using statistics package?

This surprises me... To illustrate I've used this small code to calculate mean and median of 1M random numbers:

import numpy as np
import statistics as st

import time

listofrandnum = np.random.rand(1000000,)

t = time.time()
print('mean is:', st.mean(listofrandnum))
print('time to calc mean:', time.time()-t)

print('\n')

t = time.time()
print('median is:', st.median(listofrandnum))
print('time to calc median:', time.time()-t)

For which the result is:

mean is: 0.499866595037
time to calc mean: 2.0767598152160645


median is: 0.499721597395
time to calc median: 0.9687695503234863

My question: How come mean is slower than median? median needs some sorting algorithm (i.e. comparisons) while mean requires summing. Does it make sense that a sum will be slower than comparison?

I will appreciate your insight into this.

like image 378
Aguy Avatar asked Jan 06 '23 20:01

Aguy


1 Answers

statistics isn't part of NumPy. It's a Python standard library module with a rather different design philosophy; it goes for accuracy at all costs, even for unusual input datatypes and extremely poorly-conditioned input. Performing a sum the way the statistics module does it is really expensive, more so than performing a sort.

If you want an efficient mean or median on NumPy arrays, use the NumPy routines:

numpy.mean(whatever)
numpy.median(whatever)

If you want to see the expensive work the statistics module goes through for a simple sum, you can look at the source code:

def _sum(data, start=0):
    """_sum(data [, start]) -> (type, sum, count)

    Return a high-precision sum of the given numeric data as a fraction,
    together with the type to be converted to and the count of items.

    If optional argument ``start`` is given, it is added to the total.
    If ``data`` is empty, ``start`` (defaulting to 0) is returned.


    Examples
    --------

    >>> _sum([3, 2.25, 4.5, -0.5, 1.0], 0.75)
    (<class 'float'>, Fraction(11, 1), 5)

    Some sources of round-off error will be avoided:

    >>> _sum([1e50, 1, -1e50] * 1000)  # Built-in sum returns zero.
    (<class 'float'>, Fraction(1000, 1), 3000)

    Fractions and Decimals are also supported:

    >>> from fractions import Fraction as F
    >>> _sum([F(2, 3), F(7, 5), F(1, 4), F(5, 6)])
    (<class 'fractions.Fraction'>, Fraction(63, 20), 4)

    >>> from decimal import Decimal as D
    >>> data = [D("0.1375"), D("0.2108"), D("0.3061"), D("0.0419")]
    >>> _sum(data)
    (<class 'decimal.Decimal'>, Fraction(6963, 10000), 4)

    Mixed types are currently treated as an error, except that int is
    allowed.
    """
    count = 0
    n, d = _exact_ratio(start)
    partials = {d: n}
    partials_get = partials.get
    T = _coerce(int, type(start))
    for typ, values in groupby(data, type):
        T = _coerce(T, typ)  # or raise TypeError
        for n,d in map(_exact_ratio, values):
            count += 1
            partials[d] = partials_get(d, 0) + n
    if None in partials:
        # The sum will be a NAN or INF. We can ignore all the finite
        # partials, and just look at this special one.
        total = partials[None]
        assert not _isfinite(total)
    else:
        # Sum all the partial sums using builtin sum.
        # FIXME is this faster if we sum them in order of the denominator?
        total = sum(Fraction(n, d) for d, n in sorted(partials.items()))
    return (T, total, count)
like image 98
user2357112 supports Monica Avatar answered Jan 17 '23 17:01

user2357112 supports Monica