Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculate mean and median efficiently

What is the most efficient way to sequentially find the mean and median of rows in a Python list?

For example, my list:

input_list = [1,2,4,6,7,8]

I want to produce an output list that contains:

output_list_mean = [1,1.5,2.3,3.25,4,4.7]
output_list_median = [1,1.5,2.0,3.0,4.0,5.0]

Where the mean is calculated as follows:

  • 1 = mean(1)
  • 1.5 = mean(1,2) (i.e. mean of first 2 values in input_list)
  • 2.3 = mean(1,2,4) (i.e. mean of first 3 values in input_list)
  • 3.25 = mean(1,2,4,6) (i.e. mean of first 4 values in input_list) etc.

And the median is calculated as follows:

  • 1 = median(1)
  • 1.5 = median(1,2) (i.e. median of first 2 values in input_list)
  • 2.0 = median(1,2,4) (i.e. median of first 3 values in input_list)
  • 3.0 = median(1,2,4,6) (i.e. median of first 4 values in input_list) etc.

I have tried to implement it with the following loop, but it seems very inefficient.

import numpy

input_list = [1,2,4,6,7,8]

for item in range(1,len(input_list)+1):
    print(numpy.mean(input_list[:item]))
    print(numpy.median(input_list[:item]))
like image 673
hoof_hearted Avatar asked Jul 12 '15 16:07

hoof_hearted


2 Answers

Anything you do yourself, especially with the median, is either going to require a lot of work, or be very inefficient, but Pandas comes with built-in efficient implementations of the functions you are after, the expanding mean is O(n), the expanding median is O(n*log(n)) using a skip list:

import pandas as pd
import numpy as np

input_list = [1, 2, 4, 6, 7, 8]

>>> pd.expanding_mean(np.array(input_list))
array([ 1.     ,  1.5    ,  2.33333,  3.25   ,  4.     ,  4.66667])

>>> pd.expanding_median(np.array(input_list))
array([ 1. ,  1.5,  2. ,  3. ,  4. ,  5. ])
like image 142
Jaime Avatar answered Sep 18 '22 00:09

Jaime


You can use itertools.islice to slice your array and use np.fromiter with np.mean :

>>> arr=np.array([1,2,4,6,7,8])
>>> l=arr.size
>>> from itertools import islice
>>> [np.fromiter(islice(arr,0,i+1),float).mean(dtype=np.float32) for i in xrange(l)]
[1.0, 1.5, 2.3333333, 3.25, 4.0, 4.6666665]

As an alternative answer you if you want the average you can use np.cumsum to get a cumulative sum of the your elements and divide with the main array using np.true_divide :

>>> np.true_divide(np.cumsum(arr),arr)
array([ 1. ,  1.5,  2. ,  2.5,  3. ,  3.5,  4. ,  4.5])
like image 33
Mazdak Avatar answered Sep 22 '22 00:09

Mazdak