Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: compute average of n-th elements in list of lists with different lengths

Suppose I have the following list of lists:

a = [ 
      [1, 2, 3],
      [2, 3, 4],
      [3, 4, 5, 6] 
    ]

I want to have the average of each n-th element in the arrays. However, when wanting to do this in a simple way, Python generated out-of-bounds errors because of the different lengths. I solved this by giving each array the length of the longest array, and filling the missing values with None.

Unfortunately, doing this made it impossible to compute an average, so I converted the arrays into masked arrays. The code shown below works, but it seems rather cumbersome.

import numpy as np
import numpy.ma as ma

a = [ [1, 2, 3],
      [2, 3, 4],
      [3, 4, 5, 6] ]

# Determine the length of the longest list
lenlist = []
for i in a:
    lenlist.append(len(i))
max = np.amax(lenlist)

# Fill each list up with None's until required length is reached
for i in a:
    if len(i) <= max:
        for j in range(max - len(i)):
            i.append(None)

# Fill temp_array up with the n-th element
# and add it to temp_array 
temp_list = []
masked_arrays = []
for j in range(max):
    for i in range(len(a)):
        temp_list.append(a[i][j])
    masked_arrays.append(ma.masked_values(temp_list, None))
    del temp_list[:]

# Compute the average of each array 
avg_array = []
for i in masked_arrays:
    avg_array.append(np.ma.average(i))

print avg_array

Is there a way to do this more quickly? The final list of lists will contain 600000 'rows' and up to 100 'columns', so efficiency is quite important :-).

like image 992
Laurens Jansma Avatar asked Dec 06 '22 18:12

Laurens Jansma


2 Answers

tertools.izip_longest would do all the padding with None's for you so your code can be reduced to:

import numpy as np
import numpy.ma as ma
from itertools import izip_longest

a = [ [1, 2, 3],
      [2, 3, 4],
      [3, 4, 5, 6] ]


averages = [np.ma.average(ma.masked_values(temp_list, None)) for temp_list in izip_longest(*a)]

print(averages)
[2.0, 3.0, 4.0, 6.0]

No idea what the fastest way in regard to the numpy logic but this is definitely going to be a lot more efficient than your own code.

If you wanted a faster pure python solution:

from itertools import izip_longest, imap

a = [[1, 2, 3],
     [2, 3, 4],
     [3, 4, 5, 6]]


def avg(x):
    x = filter(None, x)
    return sum(x, 0.0) / len(x)


filt = imap(avg, izip_longest(*a))

print(list(filt))
[2.0, 3.0, 4.0, 6.0]

If you have 0's in the arrays that won't work as 0 will be treated as Falsey, you will have to use a list comp to filter in that case but it will still be faster:

def avg(x):
    x = [i for i in x if i is not None]
    return sum(x, 0.0) / len(x)

filt = imap(avg, izip_longest(*a))
like image 198
Padraic Cunningham Avatar answered Dec 08 '22 08:12

Padraic Cunningham


Here's an almost* fully vectorized solution based on np.bincount and np.cumsum -

# Store lengths of each list and their cumulative and entire summations
lens = np.array([len(i) for i in a]) # Only loop to get lengths
C = lens.cumsum()
N = lens.sum()

# Create ID array such that the first element of each list is 0, 
# the second element as 1 and so on. This is needed in such a format
# for use with bincount later on.
shifts_arr = np.ones(N,dtype=int)
shifts_arr[C[:-1]] = -lens[:-1]+1
id_arr = shifts_arr.cumsum()-1

# Use bincount to get the summations and thus the 
# averages across all lists based on their positions. 
avg_out = np.bincount(id_arr,np.concatenate(a))/np.bincount(id_arr)

-* Almost because we are getting the lengths of lists with a loop, but with minimal computation involved there, must not affect the total runtime hugely.

Sample run -

In [109]: a = [ [1, 2, 3],
     ...:       [2, 3, 4],
     ...:       [3, 4, 5, 6] ]

In [110]: lens = np.array([len(i) for i in a])
     ...: C = lens.cumsum()
     ...: N = lens.sum()
     ...: 
     ...: shifts_arr = np.ones(N,dtype=int)
     ...: shifts_arr[C[:-1]] = -lens[:-1]+1
     ...: id_arr = shifts_arr.cumsum()-1
     ...: 
     ...: avg_out = np.bincount(id_arr,np.concatenate(a))/np.bincount(id_arr)
     ...: 

In [111]: avg_out
Out[111]: array([ 2.,  3.,  4.,  6.])
like image 35
Divakar Avatar answered Dec 08 '22 08:12

Divakar