From Python: tf-idf-cosine: to find document similarity , it is possible to calculate document similarity using tf-idf cosine. Without importing external libraries, are that any ways to calculate cosine similarity between 2 strings?
s1 = "This is a foo bar sentence ." s2 = "This sentence is similar to a foo bar sentence ." s3 = "What is this string ? Totally not related to the other two lines ." cosine_sim(s1, s2) # Should give high cosine similarity cosine_sim(s1, s3) # Shouldn't give high cosine similarity value cosine_sim(s2, s3) # Shouldn't give high cosine similarity value
Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. Similarity = (A.B) / (||A||. ||B||) where A and B are vectors.
Hamming Distance, named after the American mathematician, is the simplest algorithm for calculating string similarity. It checks the similarity by comparing the changes in the number of positions between the two strings.
The common way to compute the Cosine similarity is to first we need to count the word occurrence in each document. To count the word occurrence in each document, we can use CountVectorizer or TfidfVectorizer functions that are provided by Scikit-Learn library.
A simple pure-Python implementation would be:
import math import re from collections import Counter WORD = re.compile(r"\w+") def get_cosine(vec1, vec2): intersection = set(vec1.keys()) & set(vec2.keys()) numerator = sum([vec1[x] * vec2[x] for x in intersection]) sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())]) sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())]) denominator = math.sqrt(sum1) * math.sqrt(sum2) if not denominator: return 0.0 else: return float(numerator) / denominator def text_to_vector(text): words = WORD.findall(text) return Counter(words) text1 = "This is a foo bar sentence ." text2 = "This sentence is similar to a foo bar sentence ." vector1 = text_to_vector(text1) vector2 = text_to_vector(text2) cosine = get_cosine(vector1, vector2) print("Cosine:", cosine)
Prints:
Cosine: 0.861640436855
The cosine formula used here is described here.
This does not include weighting of the words by tf-idf, but in order to use tf-idf, you need to have a reasonably large corpus from which to estimate tfidf weights.
You can also develop it further, by using a more sophisticated way to extract words from a piece of text, stem or lemmatise it, etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With