Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I calculate the Jaccard Similarity of two lists containing strings in Python?

Tags:

I have two lists with usernames and I want to calculate the Jaccard similarity. Is it possible?

This thread shows how to calculate the Jaccard Similarity between two strings, however I want to apply this to two lists, where each element is one word (e.g., a username).

like image 916
Aventinus Avatar asked Oct 27 '17 13:10

Aventinus


3 Answers

I ended up writing my own solution after all:

def jaccard_similarity(list1, list2):
    intersection = len(list(set(list1).intersection(list2)))
    union = (len(set(list1)) + len(set(list2))) - intersection
    return float(intersection) / union
like image 60
Aventinus Avatar answered Sep 27 '22 06:09

Aventinus


For Python 3:

def jaccard_similarity(list1, list2):
    s1 = set(list1)
    s2 = set(list2)
    return float(len(s1.intersection(s2)) / len(s1.union(s2)))
list1 = ['dog', 'cat', 'cat', 'rat']
list2 = ['dog', 'cat', 'mouse']
jaccard_similarity(list1, list2)
>>> 0.5

For Python2 use return len(s1.intersection(s2)) / float(len(s1.union(s2)))

like image 26
w4bo Avatar answered Sep 28 '22 06:09

w4bo


@aventinus I don't have enough reputation to add a comment to your answer, but just to make things clearer, your solution measures the jaccard_similarity but the function is misnamed as jaccard_distance, which is actually 1 - jaccard_similarity

like image 14
iamlcc Avatar answered Sep 26 '22 06:09

iamlcc