Algorithm for discrete similarity metric

Question

Given that I have two lists that each contain a separate subset of a common superset, is there an algorithm to give me a similarity measurement?

Example:

A = { John, Mary, Kate, Peter } and B = { Peter, James, Mary, Kate }

How similar are these two lists? Note that I do not know all elements of the common superset.

Update: I was unclear and I have probably used the word 'set' in a sloppy fashion. My apologies. Clarification: Order is of importance. If identical elements occupy the same position in the list, we have the highest similarity for that element. The similarity decreased the farther apart the identical elements are. The similarity is even lower if the element only exists in one of the lists.

I could even add the extra dimension that lower indices are of greater value, so a a[1] == b[1] is worth more than a[9] == b[9], but that is mainly cause I am curious.

doug · Accepted Answer

The Jaccard Index (aka Tanimoto coefficient) is used precisely for the use case recited in the OP's question.

The Tanimoto coeff, tau, is equal to Nc divided by Na + Nb - Nc, or

tau = Nc / (Na + Nb - Nc)

Na, number of items in the first set
Nb, number of items in the second set
Nc, intersection of the two sets, or the number of unique items common to both a and b

Here's Tanimoto coded as a Python function:

def tanimoto(x, y) :
  w = [ ns for ns in x if ns not in y ]
  return float(len(w) / (len(x) + len(y) - len(w)))

Algorithm for discrete similarity metric

Tags:

algorithm

comparison

machine-learning

similarity

Cubed

1 Answers

doug

Recent Activity

Donate For Us

Algorithm for discrete similarity metric

Tags:

algorithm

comparison

machine-learning

similarity

Cubed

1 Answers

doug

Related questions

Recent Activity

Donate For Us