Semantic Similarity between Phrases Using GenSim

Tags:

Background

I am trying to judge whether a phrase is semantically related to other words found in a corpus using Gensim. For example, here is the corpus document pre-tokenized:

 **Corpus**
 Car Insurance
 Car Insurance Coverage
 Auto Insurance
 Best Insurance
 How much is car insurance
 Best auto coverage
 Auto policy
 Car Policy Insurance

My code (based on this gensim tutorial) judges the semantic relatendness of a phrase using cosine similarity against all strings in corpus.

Problem

It seems that if a query contains ANY of the terms found within my dictionary, that phrase is judged as being semantically similar to the corpus (e.g. **Giraffe Poop Car Murderer has a cosine similarity of 1 but SHOULD be semantically unrelated). I am not sure how to solve for this issue.

Code

#Tokenize Corpus and filter out anything that is a stop word or has a frequency <1
texts = [[word for word in document if word not in stoplist]
        for document in documents]
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1]
        for text in texts]
dictionary = corpora.Dictionary(texts)

# doc2bow counts the number of occurences of each distinct word, converts the word
# to its integer word id and returns the result as a sparse vector

corpus = [dictionary.doc2bow(text) for text in texts]  
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
doc = "giraffe poop car murderer"
vec_bow = dictionary.doc2bow(doc.lower().split())

#convert the query to LSI space
vec_lsi = lsi[vec_bow]              
index = similarities.MatrixSimilarity(lsi[corpus])

# perform a similarity query against the corpus
sims = index[vec_lsi]
sims = sorted(enumerate(sims), key=lambda item: -item[1])

506

asked Aug 05 '15 01:08

user3682157

1 Answers

First of all, you are not directly comparing the cosine similarity of bag-of-word vectors, but first reducing the dimensionality of your document vectors by applying latent semantic analysis (https://en.wikipedia.org/wiki/Latent_semantic_analysis). This is fine, but I just wanted to emphasise that. It is often assumed that the underlying semantic space of a corpus is of a lower dimensionality than the number of unique tokens. Therefore, LSA applies principal component analysis on your vector space and only keeps the directions in your vector space that contain the most variance (i.e. those directions in the space that change most rapidly, and thus are assumed to contain more information). This is influenced by the num_topics parameters you pass to the LsiModel constructor.

Secondly, I cleaned up your code a little bit and embedded the corpus:

# Tokenize Corpus and filter out anything that is a
# stop word or has a frequency <1

from gensim import corpora, models, similarities
from collections import defaultdict

documents = [
    'Car Insurance',  # doc_id 0
    'Car Insurance Coverage',  # doc_id 1
    'Auto Insurance',  # doc_id 2
    'Best Insurance',  # doc_id 3
    'How much is car insurance',  # doc_id 4
    'Best auto coverage',  # doc_id 5
    'Auto policy',  # doc_id 6
    'Car Policy Insurance',  # doc_id 7
]

stoplist = set(['is', 'how'])

texts = [[word.lower() for word in document.split()
          if word.lower() not in stoplist]
         for document in documents]

print texts
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1]
         for text in texts]
dictionary = corpora.Dictionary(texts)

# doc2bow counts the number of occurences of each distinct word,
# converts the word to its integer word id and returns the result
# as a sparse vector

corpus = [dictionary.doc2bow(text) for text in texts]
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
doc = "giraffe poop car murderer"
vec_bow = dictionary.doc2bow(doc.lower().split())

# convert the query to LSI space
vec_lsi = lsi[vec_bow]
index = similarities.MatrixSimilarity(lsi[corpus])

# perform a similarity query against the corpus
sims = index[vec_lsi]
sims = sorted(enumerate(sims), key=lambda item: -item[1])

print sims

If I run the above I get the following output:

[(0, 0.97798139), (4, 0.97798139), (7, 0.94720691), (1, 0.89220524), (3, 0.61052465), (2, 0.42138112), (6, -0.1468758), (5, -0.22077486)]

where every entry in that list corresponds to (doc_id, cosine_similarity) ordered by cosine similarity in descending order.

As in your query document, the only word that is actually part of your vocabulary (constructed from your corpus) is car, all other tokens will be dropped. Therefore, your query to your model consists of the singleton document car. Consequently, you can see that all documents which contain car are supposedly very similar to your input query.

The reason why document #3 (Best Insurance) is ranked highly as well is because token insurance often co-occurs with car (your query). This is exactly the reasoning behind distributional semantics, i.e. "a word is characterized by the company it keeps" (Firth, J. R. 1957).

175

answered Oct 14 '22 02:10

cvangysel

Related questions
                            
                                Python Multiprocessing - Just not getting it
                            
                                Python 3.x call function with sys.argv[]
                            
                                Anyone hear when NLTK 3.0 will be out?
                            
                                Unexpected behavior of universal newline mode with StringIO and csv modules
                            
                                Creating a Unicode XML from scratch with Python 3.2
                            
                                python .format() repeated fields?
                            
                                PyQt4 Give Focus to Widget as it is called by the MainWindow and set as CentralWidget?
                            
                                Python3 has no acces to python2 modules (ubuntu)
                            
                                What does '*** Oldest frame' mean in ipdb?
                            
                                Pickling multiple dictionaries
                            
                                Parse and format the date from the GitHub API in Python [duplicate]
                            
                                How do you find the first element of a path?
                            
                                Can i write the output format created by prettytable into a file? [closed]
                            
                                What is the pythonic way to bubble up error conditions
                            
                                Python multiple inheritance constructor not called when using super()
                            
                                Why do I get "python int too large to convert to C long" errors when I use matplotlib's DateFormatter to format dates on the x axis?
                            
                                how to apply ceiling to pandas DateTime
                            
                                What should I decorate with @asyncio.coroutine for async operations?
                            
                                Python3 - When exactly do you need to prepend "self._" to variable declarations within class methods? [duplicate]
                            
                                How do I check if an iterator is actually an iterator container?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Semantic Similarity between Phrases Using GenSim

Tags:

python-3.x

nltk

gensim

user3682157

People also ask

1 Answers

cvangysel

Recent Activity

Donate For Us