Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Construct single numpy array from smaller arrays of different sizes

I have an array of values, x. Given 'start' and 'stop' indices, I need to construct an array y using sub-arrays of x.

import numpy as np
x = np.arange(20)
start = np.array([2, 8, 15])
stop = np.array([5, 10, 20])
nsubarray = len(start)

Where I would like y to be:

y = array([ 2,  3,  4,  8,  9, 15, 16, 17, 18, 19])

(In practice the arrays I am using are much larger).

One way to construct y is using a list comprehension, but the list needs to be flattened afterwards:

import itertools as it
y = [x[start[i]:stop[i]] for i in range(nsubarray)]
y = np.fromiter(it.chain.from_iterable(y), dtype=int)

I found that it is actually faster to use a for-loop:

y = np.empty(sum(stop - start), dtype = int)
a = 0
for i in range(nsubarray):
    b = a + stop[i] - start[i]
    y[a:b] = x[start[i]:stop[i]]
    a = b

I was wondering if anyone knows of a way that I can optimize this? Thank you very much!

EDIT

The following tests all of the times:

import numpy as np
import numpy.random as rd
import itertools as it


def get_chunks(arr, start, stop):
    rng = stop - start
    rng = rng[rng!=0]      #Need to add this in case of zero sized ranges
    np.cumsum(rng, out=rng)
    inds = np.ones(rng[-1], dtype=np.int)
    inds[rng[:-1]] = start[1:]-stop[:-1]+1
    inds[0] = start[0]
    np.cumsum(inds, out=inds)
    return np.take(arr, inds)


def for_loop(arr, start, stop):
    y = np.empty(sum(stop - start), dtype = int)
    a = 0
    for i in range(nsubarray):
        b = a + stop[i] - start[i]
        y[a:b] = arr[start[i]:stop[i]]
        a = b
    return y

xmax = 1E6
nsubarray = 100000
x = np.arange(xmax)
start = rd.randint(0, xmax - 10, nsubarray)
stop = start + 10

Which results in:

In [379]: %timeit np.hstack([x[i:j] for i,j in it.izip(start, stop)])
1 loops, best of 3: 410 ms per loop

In [380]: %timeit for_loop(x, start, stop)
1 loops, best of 3: 281 ms per loop

In [381]: %timeit np.concatenate([x[i:j] for i,j in it.izip(start, stop)])
10 loops, best of 3: 97.8 ms per loop

In [382]: %timeit get_chunks(x, start, stop)
100 loops, best of 3: 16.6 ms per loop
like image 690
turnerm Avatar asked Mar 11 '14 13:03

turnerm


People also ask

How do I split a NumPy array into smaller arrays?

You can use numpy. split() function to split an array into more than one sub-arrays vertically (row-wise). There are two ways to split the array one is row-wise and the other is column-wise. By default, the array is split in row-wise (axis=0) .

Can you combine NumPy arrays?

Joining NumPy Arrays Joining means putting contents of two or more arrays in a single array. In SQL we join tables based on a key, whereas in NumPy we join arrays by axes. We pass a sequence of arrays that we want to join to the concatenate() function, along with the axis.

Can NumPy arrays be more than 2 dimensions?

In general numpy arrays can have more than one dimension. One way to create such array is to start with a 1-dimensional array and use the numpy reshape() function that rearranges elements of that array into a new shape.

How do you create a one dimensional NumPy array?

One dimensional array contains elements only in one dimension. In other words, the shape of the numpy array should contain only one value in the tuple. To create a one dimensional array in Numpy, you can use either of the array(), arange() or linspace() numpy functions.


1 Answers

This is a bit complicated, but quite fast. Basically what we do is create the index list based off vector addition and the use np.take instead of any python loops:

def get_chunks(arr, start, stop):
     rng = stop - start
     rng = rng[rng!=0]      #Need to add this in case of zero sized ranges
     np.cumsum(rng, out=rng)
     inds = np.ones(rng[-1], dtype=np.int)
     inds[rng[:-1]] = start[1:]-stop[:-1]+1
     inds[0] = start[0]
     np.cumsum(inds, out=inds)
     return np.take(arr, inds)

Check that it is returning the correct result:

xmax = 1E6
nsubarray = 100000
x = np.arange(xmax)
start = np.random.randint(0, xmax - 10, nsubarray)
stop = start + np.random.randint(1, 10, nsubarray)

old = np.concatenate([x[b:e] for b, e in izip(start, stop)])
new = get_chunks(x, start, stop)
np.allclose(old,new)
True

Some timings:

%timeit np.hstack([x[i:j] for i,j in zip(start, stop)])
1 loops, best of 3: 354 ms per loop

%timeit np.concatenate([x[b:e] for b, e in izip(start, stop)])
10 loops, best of 3: 119 ms per loop

%timeit get_chunks(x, start, stop)
100 loops, best of 3: 7.59 ms per loop
like image 131
Daniel Avatar answered Sep 20 '22 22:09

Daniel