In numpy / scipy (or pure python if you prefer), what would be a good way to group contiguous regions in a numpy array and count the length of these regions?
Something like this:
x = np.array([1,1,1,2,2,3,0,0,0,0,0,1,2,3,1,1,0,0,0])
y = contiguousGroup(x)
print y
>> [[1,3], [2,2], [3,1], [0,5], [1,1], [2,1], [3,1], [1,2], [0,3]]
I have tried to do this just with loops however it takes a longer time than I would like (6 seconds) to do a list with about 30 million samples and 20000 contiguous regions.
Edit:
And now for some speed comparisons (just using time.clock() and a few hundred iterations, or less if it is in seconds).
Firstly my python loop code tested on 5 samples.
Number of elements 33718251
Number of regions 135137
Time taken = 8.644007 seconds...
Number of elements 42503100
Number of regions 6985
Time taken = 10.533305 seconds...
Number of elements 21841302
Number of regions 7619335
Time taken = 7.671015 seconds...
Number of elements 19723928
Number of regions 10799
Time taken = 5.014807 seconds...
Number of elements 16619539
Number of regions 19293
Time taken = 4.207359 seconds...
And now with Divakar's vectorized solution.
Number of elements 33718251
Number of regions 135137
Time taken = 0.063470 seconds...
Number of elements 42503100
Number of regions 6985
Time taken = 0.046293 seconds...
Number of elements 21841302
Number of regions 7619335
Time taken = 1.654288 seconds...
Number of elements 19723928
Number of regions 10799
Time taken = 0.022651 seconds...
Number of elements 16619539
Number of regions 19293
Time taken = 0.021189 seconds...
Modified approach gives roughly same times (maybe 5% slower at worst)
And now with with the generator approach from Kasramvd.
Number of elements 33718251
Number of regions 135137
Time taken = 3.834922 seconds...
Number of elements 42503100
Number of regions 6985
Time taken = 4.785480 seconds...
Number of elements 21841302
Number of regions 7619335
Time taken = 6.806867 seconds...
Number of elements 19723928
Number of regions 10799
Time taken = 2.264413 seconds...
Number of elements 16619539
Number of regions 19293
Time taken = 1.778873 seconds...
And now his numpythonic version.
Number of elements 33718251
Number of regions 135137
Time taken = 0.286336 seconds...
Number of elements 42503100
Number of regions 6985
Time taken = 0.174769 seconds...
Memory error sample 3 (too many regions)
Number of elements 19723928
Number of regions 10799
Time taken = 0.087028 seconds...
Number of elements 16619539
Number of regions 19293
Time taken = 0.084963 seconds...
Anyway I think the moral of the story is that numpy is very good.
Here's a vectorized approach -
idx = np.concatenate(([0],np.flatnonzero(x[:-1]!=x[1:])+1,[x.size]))
out = zip(x[idx[:-1]],np.diff(idx))
Sample run -
In [34]: x
Out[34]: array([1, 1, 1, 2, 2, 3, 0, 0, 0, 0, 0, 1, 2, 3, 1, 1, 0, 0, 0])
In [35]: out
Out[35]: [(1, 3), (2, 2), (3, 1), (0, 5), (1, 1), (2, 1), (3, 1), (1, 2), (0, 3)]
The concatenation on the entire array could be expensive. So, a modified version that does concatenation on the group shifting indices rather could be suggested, like so -
idx0 = np.flatnonzero(x[:-1]!=x[1:])
count = np.concatenate(([idx0[0]+1],np.diff(idx0),[x.size-idx0[-1]-1]))
out = zip(x[np.append(0,idx0+1)],count)
Alternatively, at the final step, if the output as a 2D array is okay, we could avoid that zipping and use NumPy's column_stack, like so -
out = np.column_stack((x[np.append(0,idx0+1)],count))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With