Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to split numpy array and perform certain actions on split arrays [Python]

Only part of this question has been asked before ([1][2]) , which explained how to split numpy arrays. I am quite new in Python. I have an array containing 262144 items and want to split it in small arrays of a length of 512, sort them individually and sum up their first five values but I am unsure how beyond this line :

np.array_split(vector, 512)

How do I call and analyse each array ? Would it be good idea to continue to use numpy array or should I revert back and use dictionary instead ?

like image 992
Grayrigel Avatar asked Jan 29 '17 11:01

Grayrigel


2 Answers

Splitting as such won't be an efficient solution, instead we could reshape, which effectively creates subarrays as rows of a 2D array. These would be views into the input array, so no additional memory requirement there. Then, we would get argsort indices and select first five indices per row and finally sum those up for the desired output.

Thus, we would have an implementation like so -

N = 512 # Number of elements in each split array
M = 5   # Number of elements in each subarray for sorting and summing

b = a.reshape(-1,N)
out = b[np.arange(b.shape[0])[:,None], b.argsort(1)[:,:M]].sum(1)

Step-by-step sample run -

In [217]: a   # Input array
Out[217]: array([45, 19, 71, 53, 20, 33, 31, 20, 41, 19, 38, 31, 86, 34])

In [218]: N = 7 # 512 for original case, 7 for sample

In [219]: M = 5

# Reshape into M rows 2D array
In [220]: b = a.reshape(-1,N)

In [224]: b
Out[224]: 
array([[45, 19, 71, 53, 20, 33, 31],
       [20, 41, 19, 38, 31, 86, 34]])

# Get argsort indices per row
In [225]: b.argsort(1)
Out[225]: 
array([[1, 4, 6, 5, 0, 3, 2],
       [2, 0, 4, 6, 3, 1, 5]])

# Select first M ones
In [226]: b.argsort(1)[:,:M]
Out[226]: 
array([[1, 4, 6, 5, 0],
       [2, 0, 4, 6, 3]])

# Use fancy-indexing to select those M ones per row
In [227]: b[np.arange(b.shape[0])[:,None], b.argsort(1)[:,:M]]
Out[227]: 
array([[19, 20, 31, 33, 45],
       [19, 20, 31, 34, 38]])

# Finally sum along each row
In [228]: b[np.arange(b.shape[0])[:,None], b.argsort(1)[:,:M]].sum(1)
Out[228]: array([148, 142])

Performance boost with np.argpartition -

out = b[np.arange(b.shape[0])[:,None], np.argpartition(b,M,axis=1)[:,:M]].sum(1)

Runtime test -

In [236]: a = np.random.randint(11,99,(512*512))

In [237]: N = 512

In [238]: M = 5

In [239]: b = a.reshape(-1,N)

In [240]: %timeit b[np.arange(b.shape[0])[:,None], b.argsort(1)[:,:M]].sum(1)
100 loops, best of 3: 14.2 ms per loop

In [241]: %timeit b[np.arange(b.shape[0])[:,None], \
                np.argpartition(b,M,axis=1)[:,:M]].sum(1)
100 loops, best of 3: 3.57 ms per loop
like image 198
Divakar Avatar answered Oct 04 '22 20:10

Divakar


A more detailed version of doing what you want

import numpy as np
from numpy.testing.utils import assert_array_equal

vector = np.random.rand(262144)

splits = np.array_split(vector, 512)

sums = []
for split in splits:
   # sort it
   split.sort()
   # sublist
   subSplit = split[:5]
   #build sum
   splitSum = sum(subSplit)
   # add to new list
   sums.append(splitSum)

print np.array(sums).shape

Same output as @Divakar 's solution

like image 31
ppasler Avatar answered Oct 04 '22 21:10

ppasler