This surprises me... To illustrate I've used this small code to calculate mean and median of 1M random numbers:
import numpy as np
import statistics as st
import time
listofrandnum = np.random.rand(1000000,)
t = time.time()
print('mean is:', st.mean(listofrandnum))
print('time to calc mean:', time.time()-t)
print('\n')
t = time.time()
print('median is:', st.median(listofrandnum))
print('time to calc median:', time.time()-t)
For which the result is:
mean is: 0.499866595037
time to calc mean: 2.0767598152160645
median is: 0.499721597395
time to calc median: 0.9687695503234863
My question: How come mean is slower than median? median needs some sorting algorithm (i.e. comparisons) while mean requires summing. Does it make sense that a sum will be slower than comparison?
I will appreciate your insight into this.
statistics
isn't part of NumPy. It's a Python standard library module with a rather different design philosophy; it goes for accuracy at all costs, even for unusual input datatypes and extremely poorly-conditioned input. Performing a sum the way the statistics
module does it is really expensive, more so than performing a sort.
If you want an efficient mean or median on NumPy arrays, use the NumPy routines:
numpy.mean(whatever)
numpy.median(whatever)
If you want to see the expensive work the statistics
module goes through for a simple sum, you can look at the source code:
def _sum(data, start=0):
"""_sum(data [, start]) -> (type, sum, count)
Return a high-precision sum of the given numeric data as a fraction,
together with the type to be converted to and the count of items.
If optional argument ``start`` is given, it is added to the total.
If ``data`` is empty, ``start`` (defaulting to 0) is returned.
Examples
--------
>>> _sum([3, 2.25, 4.5, -0.5, 1.0], 0.75)
(<class 'float'>, Fraction(11, 1), 5)
Some sources of round-off error will be avoided:
>>> _sum([1e50, 1, -1e50] * 1000) # Built-in sum returns zero.
(<class 'float'>, Fraction(1000, 1), 3000)
Fractions and Decimals are also supported:
>>> from fractions import Fraction as F
>>> _sum([F(2, 3), F(7, 5), F(1, 4), F(5, 6)])
(<class 'fractions.Fraction'>, Fraction(63, 20), 4)
>>> from decimal import Decimal as D
>>> data = [D("0.1375"), D("0.2108"), D("0.3061"), D("0.0419")]
>>> _sum(data)
(<class 'decimal.Decimal'>, Fraction(6963, 10000), 4)
Mixed types are currently treated as an error, except that int is
allowed.
"""
count = 0
n, d = _exact_ratio(start)
partials = {d: n}
partials_get = partials.get
T = _coerce(int, type(start))
for typ, values in groupby(data, type):
T = _coerce(T, typ) # or raise TypeError
for n,d in map(_exact_ratio, values):
count += 1
partials[d] = partials_get(d, 0) + n
if None in partials:
# The sum will be a NAN or INF. We can ignore all the finite
# partials, and just look at this special one.
total = partials[None]
assert not _isfinite(total)
else:
# Sum all the partial sums using builtin sum.
# FIXME is this faster if we sum them in order of the denominator?
total = sum(Fraction(n, d) for d, n in sorted(partials.items()))
return (T, total, count)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With