How come "median" is 2x faster than "mean" using statistics package?

Question

This surprises me... To illustrate I've used this small code to calculate mean and median of 1M random numbers:

import numpy as np
import statistics as st

import time

listofrandnum = np.random.rand(1000000,)

t = time.time()
print('mean is:', st.mean(listofrandnum))
print('time to calc mean:', time.time()-t)

print('
')

t = time.time()
print('median is:', st.median(listofrandnum))
print('time to calc median:', time.time()-t)

For which the result is:

mean is: 0.499866595037
time to calc mean: 2.0767598152160645


median is: 0.499721597395
time to calc median: 0.9687695503234863

My question: How come mean is slower than median? median needs some sorting algorithm (i.e. comparisons) while mean requires summing. Does it make sense that a sum will be slower than comparison?

I will appreciate your insight into this.

user2357112 supports Monica · Accepted Answer

statistics isn't part of NumPy. It's a Python standard library module with a rather different design philosophy; it goes for accuracy at all costs, even for unusual input datatypes and extremely poorly-conditioned input. Performing a sum the way the statistics module does it is really expensive, more so than performing a sort.

If you want an efficient mean or median on NumPy arrays, use the NumPy routines:

numpy.mean(whatever)
numpy.median(whatever)

If you want to see the expensive work the statistics module goes through for a simple sum, you can look at the source code:

def _sum(data, start=0):
    """_sum(data [, start]) -> (type, sum, count)

    Return a high-precision sum of the given numeric data as a fraction,
    together with the type to be converted to and the count of items.

    If optional argument ``start`` is given, it is added to the total.
    If ``data`` is empty, ``start`` (defaulting to 0) is returned.


    Examples
    --------

    >>> _sum([3, 2.25, 4.5, -0.5, 1.0], 0.75)
    (<class 'float'>, Fraction(11, 1), 5)

    Some sources of round-off error will be avoided:

    >>> _sum([1e50, 1, -1e50] * 1000)  # Built-in sum returns zero.
    (<class 'float'>, Fraction(1000, 1), 3000)

    Fractions and Decimals are also supported:

    >>> from fractions import Fraction as F
    >>> _sum([F(2, 3), F(7, 5), F(1, 4), F(5, 6)])
    (<class 'fractions.Fraction'>, Fraction(63, 20), 4)

    >>> from decimal import Decimal as D
    >>> data = [D("0.1375"), D("0.2108"), D("0.3061"), D("0.0419")]
    >>> _sum(data)
    (<class 'decimal.Decimal'>, Fraction(6963, 10000), 4)

    Mixed types are currently treated as an error, except that int is
    allowed.
    """
    count = 0
    n, d = _exact_ratio(start)
    partials = {d: n}
    partials_get = partials.get
    T = _coerce(int, type(start))
    for typ, values in groupby(data, type):
        T = _coerce(T, typ)  # or raise TypeError
        for n,d in map(_exact_ratio, values):
            count += 1
            partials[d] = partials_get(d, 0) + n
    if None in partials:
        # The sum will be a NAN or INF. We can ignore all the finite
        # partials, and just look at this special one.
        total = partials[None]
        assert not _isfinite(total)
    else:
        # Sum all the partial sums using builtin sum.
        # FIXME is this faster if we sum them in order of the denominator?
        total = sum(Fraction(n, d) for d, n in sorted(partials.items()))
    return (T, total, count)

How come "median" is 2x faster than "mean" using statistics package?

Tags:

python

python-3.x

numpy

statistics

Aguy

1 Answers

user2357112 supports Monica

Recent Activity

Donate For Us

How come "median" is 2x faster than "mean" using statistics package?

Tags:

python

python-3.x

numpy

statistics

Aguy

1 Answers

user2357112 supports Monica

Related questions

Recent Activity

Donate For Us