I have a sorted integer array, e.g., [0, 0, 1, 1, 1, 2, 4, 4]
, and I would like to determine where the integer blocks start and how long the blocks are. The block sizes are small but the array itself can be very large, so efficiency is important. The total number of blocks is also known.
numpy.unique
does the trick:
import numpy
a = numpy.array([0, 0, 1, 1, 1, 2, 4, 4])
num_blocks = 4
print(a)
_, idx_start, count = numpy.unique(a, return_index=True, return_counts=True)
print(idx_start)
print(count)
[0 0 1 1 1 2 4 4]
[0 2 5 6]
[2 3 1 2]
but is slow. I would assume that, given the specific structure of the input array, there's a more efficient solution.
For example, something as simple as
import numpy
a = numpy.array([0, 0, 1, 1, 1, 2, 3, 3])
num_blocks = 4
k = 0
z = a[k]
block_idx = 0
counts = numpy.empty(num_blocks, dtype=int)
count = 0
while k < len(a):
if z == a[k]:
count += 1
else:
z = a[k]
counts[block_idx] = count
count = 1
block_idx += 1
k += 1
counts[block_idx] = count
print(counts)
gives the block sizes, and a simple numpy.cumsum
would give index_start
. Using a Python loop is slow of course.
Any hints?
Here's one with some masking and slicing -
def grp_start_len(a):
m = np.r_[True,a[:-1] != a[1:],True] #np.concatenate for a bit more boost
idx = np.flatnonzero(m)
return idx[:-1], np.diff(idx)
Sample run -
In [18]: a
Out[18]: array([0, 0, 1, 1, 1, 2, 4, 4])
In [19]: grp_start_len(a)
Out[19]: (array([0, 2, 5, 6]), array([2, 3, 1, 2]))
Timings (setup from @AGN Gazer's solution) -
In [24]: np.random.seed(0)
In [25]: a = np.sort(np.random.randint(1, 10000, 10000))
In [26]: %timeit _, idx_start, count = np.unique(a, return_index=True, return_counts=True)
1000 loops, best of 3: 411 µs per loop
# @AGN Gazer's solution
In [27]: %timeit st = np.where(np.ediff1d(a, a[-1] + 1, a[0] + 1))[0]; idx = st[:-1]; cnt = np.ediff1d(st)
10000 loops, best of 3: 81.2 µs per loop
In [28]: %timeit grp_start_len(a)
10000 loops, best of 3: 60.1 µs per loop
Bumping up the sizes 10x
more -
In [40]: np.random.seed(0)
In [41]: a = np.sort(np.random.randint(1, 100000, 100000))
In [42]: %timeit _, idx_start, count = np.unique(a, return_index=True, return_counts=True)
...: %timeit st = np.where(np.ediff1d(a, a[-1] + 1, a[0] + 1))[0]; idx = st[:-1]; cnt = np.ediff1d(st)
...: %timeit grp_start_len(a)
100 loops, best of 3: 5.34 ms per loop
1000 loops, best of 3: 792 µs per loop
1000 loops, best of 3: 463 µs per loop
np.where(np.ediff1d(a, None, a[0]))[0]
If you want to have the first "0" as in your answer, add a non-zero number to a[0]
:
np.where(np.ediff1d(a, None, a[0] + 1))[0]
Ah, just noticed that you also want to get block length. Then, modify the above code:
st = np.where(np.ediff1d(a, a[-1] + 1, a[0] + 1))[0]
idx = st[:-1]
cnt = np.ediff1d(st)
Then,
>>> print(idx)
[0 2 5 6]
>>> print(cnt)
[2 3 1 2]
In [69]: a = np.sort(np.random.randint(1, 10000, 10000))
In [70]: %timeit _, idx_start, count = np.unique(a, return_index=True, return_counts=True)
240 µs ± 7.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [71]: %timeit st = np.where(np.ediff1d(a, a[-1] + 1, a[0] + 1))[0]; idx = st[:-1]; cnt = np.ediff1d(st)
74.3 µs ± 816 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With