Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculate cosine similarity given 2 sentence strings

From Python: tf-idf-cosine: to find document similarity , it is possible to calculate document similarity using tf-idf cosine. Without importing external libraries, are that any ways to calculate cosine similarity between 2 strings?

s1 = "This is a foo bar sentence ." s2 = "This sentence is similar to a foo bar sentence ." s3 = "What is this string ? Totally not related to the other two lines ."  cosine_sim(s1, s2) # Should give high cosine similarity cosine_sim(s1, s3) # Shouldn't give high cosine similarity value cosine_sim(s2, s3) # Shouldn't give high cosine similarity value 
like image 362
alvas Avatar asked Mar 02 '13 10:03

alvas


People also ask

How do you find the cosine similarity between two sentences?

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. Similarity = (A.B) / (||A||. ||B||) where A and B are vectors.

How do you find the similarity between two strings?

Hamming Distance, named after the American mathematician, is the simplest algorithm for calculating string similarity. It checks the similarity by comparing the changes in the number of positions between the two strings.

How do you find the cosine similarity between two documents?

The common way to compute the Cosine similarity is to first we need to count the word occurrence in each document. To count the word occurrence in each document, we can use CountVectorizer or TfidfVectorizer functions that are provided by Scikit-Learn library.


1 Answers

A simple pure-Python implementation would be:

import math import re from collections import Counter  WORD = re.compile(r"\w+")   def get_cosine(vec1, vec2):     intersection = set(vec1.keys()) & set(vec2.keys())     numerator = sum([vec1[x] * vec2[x] for x in intersection])      sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())])     sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())])     denominator = math.sqrt(sum1) * math.sqrt(sum2)      if not denominator:         return 0.0     else:         return float(numerator) / denominator   def text_to_vector(text):     words = WORD.findall(text)     return Counter(words)   text1 = "This is a foo bar sentence ." text2 = "This sentence is similar to a foo bar sentence ."  vector1 = text_to_vector(text1) vector2 = text_to_vector(text2)  cosine = get_cosine(vector1, vector2)  print("Cosine:", cosine) 

Prints:

Cosine: 0.861640436855 

The cosine formula used here is described here.

This does not include weighting of the words by tf-idf, but in order to use tf-idf, you need to have a reasonably large corpus from which to estimate tfidf weights.

You can also develop it further, by using a more sophisticated way to extract words from a piece of text, stem or lemmatise it, etc.

like image 141
vpekar Avatar answered Nov 11 '22 04:11

vpekar