I have n
data points in some arbitrary space and I cluster them.
The result of my clustering algorithm is a partition represented by an int vector l
of length n
assigning each point to a cluster. Values of l
ranges from 0 to (possibly) n-1
.
Example:
l_1 = [ 1 1 1 0 0 2 6 ]
Is a partition of n=7
points into 4 clusters: first three points are clustered together, the fourth and fifth are together and the last two points forms two distinct singleton clusters.
Suppose I have two partitions l_1
and l_2
how can I efficiently determine if they represents identical partitions?
Example:
l_2 = [ 2 2 2 9 9 3 1 ]
is identical to l_1
since it represents the same partitions of the points (despite the fact that the "numbers"/"labels" of the clusters are not identical).
On the other hand
l_3 = [ 2 2 2 9 9 3 3 ]
is no longer identical since it groups together the last two points.
I'm looking for a solution in either C++, python or Matlab.
A naive approach would be to compare the co-occurrence matrix
c1 = bsxfun( @eq, l_1, l_1' );
c2 = bsxfun( @eq, l_2, l_2' );
l_1_l_2_are_identical = all( c1(:)==c2(:) );
The co-occurrence matrix c1
is of size n
xn
with true
if points k
and m
are in the same cluster and false
otherwise (regardless of the cluster "number"/"label").
Therefore if the co-occurrence matrices c1
and c2
are identical then l_1
and l_2
represent identical partitions.
However, since the number of points, n
, might be quite large I would like to avoid O(n^2
) solutions...
Any ideas?
Thanks!
When are two partition identical?
Probably if they have the exact same members.
So if you just want to test for identity, you can do the following:
Substitute each partition ID with the smallest object ID in the partition.
Then two partitionings are identical if and only if this representation is identical.
In your example above, lets assume the vector index 1 .. 7 is your object ID. Then I would get the canonical form
[ 1 1 1 4 4 6 7 ]
^ first occurrence at pos 1 of 1 in l_1 / 2 in l_2
^ first occurrence at pos 4
for l_1 and l_2, whereas l_3 canonicalizes to
[ 1 1 1 4 4 6 6 ]
To make it more clear, here is another example:
l_4 = [ A B 0 D 0 B A ]
canonicalizes to
[ 1 2 3 4 3 2 1 ]
since the first occurence of cluster "A" is at position 1, "B" at position 2 etc.
If you want to measure how similar two clusterings are, a good approach is to look at precision/recall/f1 of the object pairs, where the pair (a,b) exists if and only if a and b belong to the same cluster.
Update: Since it was claimed that this is quadratic, I will further clarify.
To produce the canonical form, use the following approach (actual python code):
def canonical_form(li):
""" Note, this implementation overwrites li """
first = dict()
for i in range(len(li)):
v = first.get(li[i])
if v is None:
first[li[i]] = i
v = i
li[i] = v
return li
print canonical_form([ 1, 1, 1, 0, 0, 2, 6 ])
# [0, 0, 0, 3, 3, 5, 6]
print canonical_form([ 2, 2, 2, 9, 9, 3, 1 ])
# [0, 0, 0, 3, 3, 5, 6]
print canonical_form([ 2, 2, 2, 9, 9, 3, 3 ])
# [0, 0, 0, 3, 3, 5, 5]
print canonical_form(['A','B',0,'D',0,'B','A'])
# [0, 1, 2, 3, 2, 1, 0]
print canonical_form([1,1,1,0,0,2,6]) == canonical_form([2,2,2,9,9,3,1])
# True
print canonical_form([1,1,1,0,0,2,6]) == canonical_form([2,2,2,9,9,3,3])
# False
If you are going to relabel your partitions, as has been previously suggested, you will potentially need to search through n labels for each of the n items. I.e. the solutions are O(n^2).
Here is my idea: Scan through both lists simultaneously, maintaining a counter for each partition label in each list. You will need to be able to map partition labels to counter numbers. If the counters for each list do not match, then the partitions do not match. This would be O(n).
Here is a proof of concept in Python:
l_1 = [ 1, 1, 1, 0, 0, 2, 6 ]
l_2 = [ 2, 2, 2, 9, 9, 3, 1 ]
l_3 = [ 2, 2, 2, 9, 9, 3, 3 ]
d1 = dict()
d2 = dict()
c1 = []
c2 = []
# assume lists same length
match = True
for i in range(len(l_1)):
if l_1[i] not in d1:
x1 = len(c1)
d1[l_1[i]] = x1
c1.append(1)
else:
x1 = d1[l_1[i]]
c1[x1] += 1
if l_2[i] not in d2:
x2 = len(c2)
d2[l_2[i]] = x2
c2.append(1)
else:
x2 = d2[l_2[i]]
c2[x2] += 1
if x1 != x2 or c1[x1] != c2[x2]:
match = False
print "match = {}".format(match)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With