Background
I am trying to judge whether a phrase is semantically related to other words found in a corpus using Gensim. For example, here is the corpus document pre-tokenized:
**Corpus**
Car Insurance
Car Insurance Coverage
Auto Insurance
Best Insurance
How much is car insurance
Best auto coverage
Auto policy
Car Policy Insurance
My code (based on this gensim tutorial) judges the semantic relatendness of a phrase using cosine similarity against all strings in corpus.
Problem
It seems that if a query contains ANY of the terms found within my dictionary, that phrase is judged as being semantically similar to the corpus (e.g. **Giraffe Poop Car Murderer has a cosine similarity of 1 but SHOULD be semantically unrelated). I am not sure how to solve for this issue.
Code
#Tokenize Corpus and filter out anything that is a stop word or has a frequency <1
texts = [[word for word in document if word not in stoplist]
for document in documents]
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
for token in text:
frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1]
for text in texts]
dictionary = corpora.Dictionary(texts)
# doc2bow counts the number of occurences of each distinct word, converts the word
# to its integer word id and returns the result as a sparse vector
corpus = [dictionary.doc2bow(text) for text in texts]
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
doc = "giraffe poop car murderer"
vec_bow = dictionary.doc2bow(doc.lower().split())
#convert the query to LSI space
vec_lsi = lsi[vec_bow]
index = similarities.MatrixSimilarity(lsi[corpus])
# perform a similarity query against the corpus
sims = index[vec_lsi]
sims = sorted(enumerate(sims), key=lambda item: -item[1])
You can just add the word vectors of one sentence together. Then count the Cosine similarity of two sentence vector as the similarity of two sentence.
docsim – Document similarity queries. Compute similarities across a collection of documents in the Vector Space Model. The main class is Similarity , which builds an index for a given set of documents.
Word order similarity is a way to assess sentences simi- larity considering order of words. Two sentences are typically similar if same words exist in both sentences in the same order. However, sentences should be considered as not completely similar if words of a sentence have dif- ferent order as the other sentence.
Word2Vec is a model used to represent words into vectors. Then, the similarity value can be generated using the Cosine Similarity formula of the word vector values produced by the Word2Vec model.
First of all, you are not directly comparing the cosine similarity of bag-of-word vectors, but first reducing the dimensionality of your document vectors by applying latent semantic analysis (https://en.wikipedia.org/wiki/Latent_semantic_analysis). This is fine, but I just wanted to emphasise that. It is often assumed that the underlying semantic space of a corpus is of a lower dimensionality than the number of unique tokens. Therefore, LSA applies principal component analysis on your vector space and only keeps the directions in your vector space that contain the most variance (i.e. those directions in the space that change most rapidly, and thus are assumed to contain more information). This is influenced by the num_topics
parameters you pass to the LsiModel
constructor.
Secondly, I cleaned up your code a little bit and embedded the corpus:
# Tokenize Corpus and filter out anything that is a
# stop word or has a frequency <1
from gensim import corpora, models, similarities
from collections import defaultdict
documents = [
'Car Insurance', # doc_id 0
'Car Insurance Coverage', # doc_id 1
'Auto Insurance', # doc_id 2
'Best Insurance', # doc_id 3
'How much is car insurance', # doc_id 4
'Best auto coverage', # doc_id 5
'Auto policy', # doc_id 6
'Car Policy Insurance', # doc_id 7
]
stoplist = set(['is', 'how'])
texts = [[word.lower() for word in document.split()
if word.lower() not in stoplist]
for document in documents]
print texts
frequency = defaultdict(int)
for text in texts:
for token in text:
frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1]
for text in texts]
dictionary = corpora.Dictionary(texts)
# doc2bow counts the number of occurences of each distinct word,
# converts the word to its integer word id and returns the result
# as a sparse vector
corpus = [dictionary.doc2bow(text) for text in texts]
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
doc = "giraffe poop car murderer"
vec_bow = dictionary.doc2bow(doc.lower().split())
# convert the query to LSI space
vec_lsi = lsi[vec_bow]
index = similarities.MatrixSimilarity(lsi[corpus])
# perform a similarity query against the corpus
sims = index[vec_lsi]
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print sims
If I run the above I get the following output:
[(0, 0.97798139), (4, 0.97798139), (7, 0.94720691), (1, 0.89220524), (3, 0.61052465), (2, 0.42138112), (6, -0.1468758), (5, -0.22077486)]
where every entry in that list corresponds to (doc_id, cosine_similarity)
ordered by cosine similarity in descending order.
As in your query document, the only word that is actually part of your vocabulary (constructed from your corpus) is car
, all other tokens will be dropped. Therefore, your query to your model consists of the singleton document car
. Consequently, you can see that all documents which contain car
are supposedly very similar to your input query.
The reason why document #3 (Best Insurance
) is ranked highly as well is because token insurance
often co-occurs with car
(your query). This is exactly the reasoning behind distributional semantics, i.e. "a word is characterized by the company it keeps" (Firth, J. R. 1957).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With