Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python: How to calculate the cosine similarity of two word lists?

I want to calculate the cosine similarity of two lists like following:

A = [u'home (private)', u'bank', u'bank', u'building(condo/apartment)','factory']

B = [u'home (private)', u'school', u'bank', u'shopping mall']

I know the cosine similarity of A and B should be

3/(sqrt(7)*sqrt(4)).

I try to reform the lists into forms like 'home bank bank building factory', which looks like a sentence, however, some elements (e.g. home (private)) have blank space in itself and some elements have brackets so I find it difficult to calculate the word occurrence.

Do you know how to calculate the word occurrence in this complicated list, so that for list B, word occurrence can be represented as

{'home (private):1, 'school':1, 'bank': 1, 'shopping mall':1}? 

Or do you know how to calculate the cosine similarity of these two lists?

Thank you very much

like image 326
gladys0313 Avatar asked Mar 02 '15 20:03

gladys0313


People also ask

How do you find the cosine similarity between two words?

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. Similarity = (A.B) / (||A||. ||B||) where A and B are vectors.

How do you find the cosine similarity between two documents in Python?

From Python: tf-idf-cosine: to find document similarity , it is possible to calculate document similarity using tf-idf cosine.

How do you find the similarity between two lists?

Python sort() method and == operator to compare lists We can club the Python sort() method with the == operator to compare two lists. Python sort() method is used to sort the input lists with a purpose that if the two input lists are equal, then the elements would reside at the same index positions.


1 Answers

from collections import Counter

# word-lists to compare
a = [u'home (private)', u'bank', u'bank', u'building(condo/apartment)','factory']
b = [u'home (private)', u'school', u'bank', u'shopping mall']

# count word occurrences
a_vals = Counter(a)
b_vals = Counter(b)

# convert to word-vectors
words  = list(a_vals.keys() | b_vals.keys())
a_vect = [a_vals.get(word, 0) for word in words]        # [0, 0, 1, 1, 2, 1]
b_vect = [b_vals.get(word, 0) for word in words]        # [1, 1, 1, 0, 1, 0]

# find cosine
len_a  = sum(av*av for av in a_vect) ** 0.5             # sqrt(7)
len_b  = sum(bv*bv for bv in b_vect) ** 0.5             # sqrt(4)
dot    = sum(av*bv for av,bv in zip(a_vect, b_vect))    # 3
cosine = dot / (len_a * len_b)                          # 0.5669467
like image 143
Hugh Bothwell Avatar answered Oct 24 '22 23:10

Hugh Bothwell