Consider the sorted array a
:
a = np.array([0, 2, 3, 4, 5, 10, 11, 11, 14, 19, 20, 20])
If I specified left and right deltas,
delta_left, delta_right = 1, 1
Then this is how I'd expect the clusters to be assigned:
# a = [ 0 . 2 3 4 5 . . . . 10 11 . . 14 . . . . 19 20
# 11 20
#
# [10--|-12] [19--|-21]
# [1--|--3] [10--|-12] [19--|-21]
# [-1--|--1] [3--|--5] [9--|-11] [18--|-20]
# +--+--|--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--|
# [2--|--4] [13--|-15]
#
# │ ╰──┬───╯ ╰┬─╯ │ ╰┬─╯
# │ cluster 2 Cluster 3 │ Cluster 5
# Cluster 1 Cluster 4
NOTE: Despite the interval [-1, 1]
sharing an edge with [1, 3]
, neither interval includes an adjacent point and therefore do not constitute joining their respective clusters.
Assuming the cluster assignments were stored in an array named clusters
, I'd expect the results to look like this
print(clusters)
[1 2 2 2 2 3 3 3 4 5 5 5]
However, suppose I change the left and right deltas to be different:
delta_left, delta_right = 2, 1
This means that for a value of x
it should be combined with any other point in the interval [x - 2, x + 1]
# a = [ 0 . 2 3 4 5 . . . . 10 11 . . 14 . . . . 19 20
# 11 20
#
# [9-----|-12] [18-----|-21]
# [0-----|--3] [9-----|-12] [18-----|-21]
# [-2-----|--1][2-----|--5] [8-----|-11] [17-----|-20]
# +--+--|--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--|
# [1 ----|--4] [12-----|-15]
#
# ╰─────┬─────╯ ╰┬─╯ │ ╰┬─╯
# cluster 1 Cluster 2 │ Cluster 4
# Cluster 3
NOTE: Despite the interval [9, 12]
sharing an edge with [12, 15]
, neither interval includes an adjacent point and therefore do not constitute joining their respective clusters.
Assuming the cluster assignments were stored in an array named clusters
, I'd expect the results to look like this:
print(clusters)
[1 1 1 1 1 2 2 2 3 4 4 4]
We will leverage np.searchsorted
and logic to find cluster edges.
First, let's take a closer look at what np.searchsorted
does:
Find the indices into a sorted array a such that, if the corresponding elements in v were inserted before the indices, the order of a would be preserved.
What I'll do is execute np.searchsorted
with a
using a - delta_left
. Let's look at that for delta_left = 1
# a =
# [ 0 2 3 4 5 10 11 11 14 19 20 20]
#
# a - delta_left
# [-1 1 2 3 4 9 10 10 13 18 19 19]
-1
would get inserted at position 0
to maintain order1
would get inserted at position 1
to maintain order2
would get inserted at position 1
as well, indicating that 2
might be in the same cluster as 1
3
would get inserted at position 2
indicating that 3
might be in the same cluster as 2
What we notice is that only when an element less delta would get inserted at its current position would we consider a new cluster starting.
We do this again for the right side with a difference. The difference is that by default if a bunch of elements are the same, np.searchsorted
assumes to insert into the front of values. To identify the ends of clusters, I'm going to want to insert after the identical elements. Therefore I'll use the paramater side='right'
If ‘left’, the index of the first suitable location found is given. If ‘right’, return the last such index. If there is no suitable index, return either 0 or N (where N is the length of a).
Now the logic. A cluster can only begin if a prior cluster has ended, with the exception of the first cluster. We'll then consider a shifted version of the results of our second np.searchsorted
Let's now define our function
def delta_cluster(a, dleft, dright):
# use to track whether searchsorted results are at correct positions
rng = np.arange(len(a))
edge_left = a.searchsorted(a - dleft)
starts = edge_left == rng
# we append 0 to shift
edge_right = np.append(0, a.searchsorted(a + dright, side='right')[:-1])
ends = edge_right == rng
return (starts & ends).cumsum()
demonstration
with left, right deltas equal to 1 and 1
print(delta_cluster(a, 1, 1))
[1 2 2 2 2 3 3 3 4 5 5 5]
with left, right deltas equal to 2 and 1
print(delta_cluster(a, 2, 1))
[1 1 1 1 1 2 2 2 3 4 4 4]
Extra Credit
What if a
isn't sorted?
I'll utilize information learned from this post
def delta_cluster(a, dleft, dright):
s = a.argsort()
size = s.size
if size > 1000:
y = np.empty(s.size, dtype=np.int64)
y[s] = np.arange(s.size)
else:
y = s.argsort()
a = a[s]
rng = np.arange(len(a))
edge_left = a.searchsorted(a - dleft)
starts = edge_left == rng
edge_right = np.append(0, a.searchsorted(a + dright, side='right')[:-1])
ends = edge_right == rng
return (starts & ends).cumsum()[y]
demonstration
b = np.random.permutation(a)
print(b)
[14 10 3 11 20 0 19 20 4 11 5 2]
print(delta_cluster(a, 2, 1))
[1 1 1 1 1 2 2 2 3 4 4 4]
print(delta_cluster(b, 2, 1))
[3 2 1 2 4 1 4 4 1 2 1 1]
print(delta_cluster(b, 2, 1)[b.argsort()])
[1 1 1 1 1 2 2 2 3 4 4 4]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With