Determine blocks in sorted numpy integer array

Question

I have a sorted integer array, e.g., [0, 0, 1, 1, 1, 2, 4, 4], and I would like to determine where the integer blocks start and how long the blocks are. The block sizes are small but the array itself can be very large, so efficiency is important. The total number of blocks is also known.

numpy.unique does the trick:

import numpy


a = numpy.array([0, 0, 1, 1, 1, 2, 4, 4])
num_blocks = 4
print(a)

_, idx_start, count = numpy.unique(a, return_index=True, return_counts=True)

print(idx_start)
print(count)

[0 0 1 1 1 2 4 4]
[0 2 5 6]
[2 3 1 2]

but is slow. I would assume that, given the specific structure of the input array, there's a more efficient solution.

For example, something as simple as

import numpy

a = numpy.array([0, 0, 1, 1, 1, 2, 3, 3])
num_blocks = 4


k = 0
z = a[k]
block_idx = 0
counts = numpy.empty(num_blocks, dtype=int)
count = 0
while k < len(a):
    if z == a[k]:
        count += 1
    else:
        z = a[k]
        counts[block_idx] = count
        count = 1
        block_idx += 1
    k += 1
counts[block_idx] = count

print(counts)

gives the block sizes, and a simple numpy.cumsum would give index_start. Using a Python loop is slow of course.

Any hints?

Divakar · Accepted Answer

Here's one with some masking and slicing -

def grp_start_len(a):
    m = np.r_[True,a[:-1] != a[1:],True] #np.concatenate for a bit more boost
    idx = np.flatnonzero(m)
    return idx[:-1], np.diff(idx)

Sample run -

In [18]: a
Out[18]: array([0, 0, 1, 1, 1, 2, 4, 4])

In [19]: grp_start_len(a)
Out[19]: (array([0, 2, 5, 6]), array([2, 3, 1, 2]))

Timings (setup from @AGN Gazer's solution) -

In [24]: np.random.seed(0)

In [25]: a = np.sort(np.random.randint(1, 10000, 10000))

In [26]: %timeit _, idx_start, count = np.unique(a, return_index=True, return_counts=True)
1000 loops, best of 3: 411 µs per loop

# @AGN Gazer's solution
In [27]: %timeit st = np.where(np.ediff1d(a, a[-1] + 1, a[0] + 1))[0]; idx = st[:-1]; cnt = np.ediff1d(st)
10000 loops, best of 3: 81.2 µs per loop

In [28]: %timeit grp_start_len(a)
10000 loops, best of 3: 60.1 µs per loop

Bumping up the sizes 10x more -

In [40]: np.random.seed(0)

In [41]: a = np.sort(np.random.randint(1, 100000, 100000))

In [42]: %timeit _, idx_start, count = np.unique(a, return_index=True, return_counts=True)
    ...: %timeit st = np.where(np.ediff1d(a, a[-1] + 1, a[0] + 1))[0]; idx = st[:-1]; cnt = np.ediff1d(st)
    ...: %timeit grp_start_len(a)
100 loops, best of 3: 5.34 ms per loop
1000 loops, best of 3: 792 µs per loop
1000 loops, best of 3: 463 µs per loop

AGN Gazer · Answer

np.where(np.ediff1d(a, None, a[0]))[0]

If you want to have the first "0" as in your answer, add a non-zero number to a[0]:

np.where(np.ediff1d(a, None, a[0] + 1))[0]

EDIT (Block length):

Ah, just noticed that you also want to get block length. Then, modify the above code:

st = np.where(np.ediff1d(a, a[-1] + 1, a[0] + 1))[0]
idx = st[:-1]
cnt = np.ediff1d(st)

Then,

>>> print(idx)
[0 2 5 6]
>>> print(cnt)
[2 3 1 2]

EDIT 2 (Timing tests)

In [69]: a = np.sort(np.random.randint(1, 10000, 10000))

In [70]: %timeit _, idx_start, count = np.unique(a, return_index=True, return_counts=True)
240 µs ± 7.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [71]: %timeit st = np.where(np.ediff1d(a, a[-1] + 1, a[0] + 1))[0]; idx = st[:-1]; cnt = np.ediff1d(st)
74.3 µs ± 816 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Determine blocks in sorted numpy integer array

Tags:

python

arrays

numpy

Nico Schlömer

2 Answers

Divakar

EDIT (Block length):

EDIT 2 (Timing tests)

AGN Gazer

Recent Activity

Donate For Us

Determine blocks in sorted numpy integer array

Tags:

python

arrays

numpy

Nico Schlömer

2 Answers

Divakar

EDIT (Block length):

EDIT 2 (Timing tests)

AGN Gazer

Related questions

Recent Activity

Donate For Us