Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Semantic Similarity between Phrases Using GenSim

Background

I am trying to judge whether a phrase is semantically related to other words found in a corpus using Gensim. For example, here is the corpus document pre-tokenized:

 **Corpus**
 Car Insurance
 Car Insurance Coverage
 Auto Insurance
 Best Insurance
 How much is car insurance
 Best auto coverage
 Auto policy
 Car Policy Insurance

My code (based on this gensim tutorial) judges the semantic relatendness of a phrase using cosine similarity against all strings in corpus.

Problem

It seems that if a query contains ANY of the terms found within my dictionary, that phrase is judged as being semantically similar to the corpus (e.g. **Giraffe Poop Car Murderer has a cosine similarity of 1 but SHOULD be semantically unrelated). I am not sure how to solve for this issue.

Code

#Tokenize Corpus and filter out anything that is a stop word or has a frequency <1
texts = [[word for word in document if word not in stoplist]
        for document in documents]
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1]
        for text in texts]
dictionary = corpora.Dictionary(texts)

# doc2bow counts the number of occurences of each distinct word, converts the word
# to its integer word id and returns the result as a sparse vector

corpus = [dictionary.doc2bow(text) for text in texts]  
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
doc = "giraffe poop car murderer"
vec_bow = dictionary.doc2bow(doc.lower().split())

#convert the query to LSI space
vec_lsi = lsi[vec_bow]              
index = similarities.MatrixSimilarity(lsi[corpus])

# perform a similarity query against the corpus
sims = index[vec_lsi]
sims = sorted(enumerate(sims), key=lambda item: -item[1])
like image 506
user3682157 Avatar asked Aug 05 '15 01:08

user3682157


People also ask

How do you find the similarity between two sentences using Word2Vec?

You can just add the word vectors of one sentence together. Then count the Cosine similarity of two sentence vector as the similarity of two sentence.

What is Gensim similarity?

docsim – Document similarity queries. Compute similarities across a collection of documents in the Vector Space Model. The main class is Similarity , which builds an index for a given set of documents.

How do you find the similarity between two sentences?

Word order similarity is a way to assess sentences simi- larity considering order of words. Two sentences are typically similar if same words exist in both sentences in the same order. However, sentences should be considered as not completely similar if words of a sentence have dif- ferent order as the other sentence.

How does Word2Vec measure similarity?

Word2Vec is a model used to represent words into vectors. Then, the similarity value can be generated using the Cosine Similarity formula of the word vector values produced by the Word2Vec model.


1 Answers

First of all, you are not directly comparing the cosine similarity of bag-of-word vectors, but first reducing the dimensionality of your document vectors by applying latent semantic analysis (https://en.wikipedia.org/wiki/Latent_semantic_analysis). This is fine, but I just wanted to emphasise that. It is often assumed that the underlying semantic space of a corpus is of a lower dimensionality than the number of unique tokens. Therefore, LSA applies principal component analysis on your vector space and only keeps the directions in your vector space that contain the most variance (i.e. those directions in the space that change most rapidly, and thus are assumed to contain more information). This is influenced by the num_topics parameters you pass to the LsiModel constructor.

Secondly, I cleaned up your code a little bit and embedded the corpus:

# Tokenize Corpus and filter out anything that is a
# stop word or has a frequency <1

from gensim import corpora, models, similarities
from collections import defaultdict

documents = [
    'Car Insurance',  # doc_id 0
    'Car Insurance Coverage',  # doc_id 1
    'Auto Insurance',  # doc_id 2
    'Best Insurance',  # doc_id 3
    'How much is car insurance',  # doc_id 4
    'Best auto coverage',  # doc_id 5
    'Auto policy',  # doc_id 6
    'Car Policy Insurance',  # doc_id 7
]

stoplist = set(['is', 'how'])

texts = [[word.lower() for word in document.split()
          if word.lower() not in stoplist]
         for document in documents]

print texts
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1]
         for text in texts]
dictionary = corpora.Dictionary(texts)

# doc2bow counts the number of occurences of each distinct word,
# converts the word to its integer word id and returns the result
# as a sparse vector

corpus = [dictionary.doc2bow(text) for text in texts]
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
doc = "giraffe poop car murderer"
vec_bow = dictionary.doc2bow(doc.lower().split())

# convert the query to LSI space
vec_lsi = lsi[vec_bow]
index = similarities.MatrixSimilarity(lsi[corpus])

# perform a similarity query against the corpus
sims = index[vec_lsi]
sims = sorted(enumerate(sims), key=lambda item: -item[1])

print sims

If I run the above I get the following output:

[(0, 0.97798139), (4, 0.97798139), (7, 0.94720691), (1, 0.89220524), (3, 0.61052465), (2, 0.42138112), (6, -0.1468758), (5, -0.22077486)]

where every entry in that list corresponds to (doc_id, cosine_similarity) ordered by cosine similarity in descending order.

As in your query document, the only word that is actually part of your vocabulary (constructed from your corpus) is car, all other tokens will be dropped. Therefore, your query to your model consists of the singleton document car. Consequently, you can see that all documents which contain car are supposedly very similar to your input query.

The reason why document #3 (Best Insurance) is ranked highly as well is because token insurance often co-occurs with car (your query). This is exactly the reasoning behind distributional semantics, i.e. "a word is characterized by the company it keeps" (Firth, J. R. 1957).

like image 175
cvangysel Avatar answered Oct 14 '22 02:10

cvangysel