Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grouping continguous values in a numpy array with their length

In numpy / scipy (or pure python if you prefer), what would be a good way to group contiguous regions in a numpy array and count the length of these regions?

Something like this:

x = np.array([1,1,1,2,2,3,0,0,0,0,0,1,2,3,1,1,0,0,0])
y = contiguousGroup(x)
print y

>> [[1,3], [2,2], [3,1], [0,5], [1,1], [2,1], [3,1], [1,2], [0,3]]

I have tried to do this just with loops however it takes a longer time than I would like (6 seconds) to do a list with about 30 million samples and 20000 contiguous regions.

Edit:

And now for some speed comparisons (just using time.clock() and a few hundred iterations, or less if it is in seconds).

Firstly my python loop code tested on 5 samples.

Number of elements  33718251
Number of regions  135137
Time taken = 8.644007 seconds...

Number of elements  42503100
Number of regions  6985
Time taken = 10.533305 seconds...

Number of elements  21841302
Number of regions  7619335
Time taken = 7.671015 seconds...

Number of elements  19723928
Number of regions  10799
Time taken = 5.014807 seconds...

Number of elements  16619539
Number of regions  19293
Time taken = 4.207359 seconds...

And now with Divakar's vectorized solution.

Number of elements  33718251
Number of regions  135137
Time taken = 0.063470 seconds...

Number of elements  42503100
Number of regions  6985
Time taken = 0.046293 seconds...

Number of elements  21841302
Number of regions  7619335
Time taken = 1.654288 seconds...

Number of elements  19723928
Number of regions  10799
Time taken = 0.022651 seconds...

Number of elements  16619539
Number of regions  19293
Time taken = 0.021189 seconds...

Modified approach gives roughly same times (maybe 5% slower at worst)

And now with with the generator approach from Kasramvd.

Number of elements  33718251
Number of regions  135137
Time taken = 3.834922 seconds...

Number of elements  42503100
Number of regions  6985
Time taken = 4.785480 seconds...

Number of elements  21841302
Number of regions  7619335
Time taken = 6.806867 seconds...

Number of elements  19723928
Number of regions  10799
Time taken = 2.264413 seconds...

Number of elements  16619539
Number of regions  19293
Time taken = 1.778873 seconds...

And now his numpythonic version.

Number of elements  33718251
Number of regions  135137
Time taken = 0.286336 seconds...

Number of elements  42503100
Number of regions  6985
Time taken = 0.174769 seconds...

Memory error sample 3 (too many regions)

Number of elements  19723928
Number of regions  10799
Time taken = 0.087028 seconds...

Number of elements  16619539
Number of regions  19293
Time taken = 0.084963 seconds...

Anyway I think the moral of the story is that numpy is very good.

like image 588
AMagee Avatar asked Nov 18 '25 15:11

AMagee


1 Answers

Here's a vectorized approach -

idx = np.concatenate(([0],np.flatnonzero(x[:-1]!=x[1:])+1,[x.size]))
out = zip(x[idx[:-1]],np.diff(idx))

Sample run -

In [34]: x
Out[34]: array([1, 1, 1, 2, 2, 3, 0, 0, 0, 0, 0, 1, 2, 3, 1, 1, 0, 0, 0])

In [35]: out
Out[35]: [(1, 3), (2, 2), (3, 1), (0, 5), (1, 1), (2, 1), (3, 1), (1, 2), (0, 3)]

The concatenation on the entire array could be expensive. So, a modified version that does concatenation on the group shifting indices rather could be suggested, like so -

idx0 = np.flatnonzero(x[:-1]!=x[1:])
count = np.concatenate(([idx0[0]+1],np.diff(idx0),[x.size-idx0[-1]-1]))
out = zip(x[np.append(0,idx0+1)],count)

Alternatively, at the final step, if the output as a 2D array is okay, we could avoid that zipping and use NumPy's column_stack, like so -

out = np.column_stack((x[np.append(0,idx0+1)],count))
like image 104
Divakar Avatar answered Nov 20 '25 05:11

Divakar



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!