Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Gensim: how to calculate document similarity using the LDA model?

I've got a trained LDA model and I want to calculate the similarity score between two documents from the corpus I trained my model on. After studying all the Gensim tutorials and functions, I still can't get my head around it. Can somebody give me a hint? Thanks!

like image 265
still_st Avatar asked Mar 16 '14 06:03

still_st


3 Answers

Depends what similarity metric you want to use.

Cosine similarity is universally useful & built-in:

sim = gensim.matutils.cossim(vec_lda1, vec_lda2) 

Hellinger distance is useful for similarity between probability distributions (such as LDA topics):

import numpy as np dense1 = gensim.matutils.sparse2full(lda_vec1, lda.num_topics) dense2 = gensim.matutils.sparse2full(lda_vec2, lda.num_topics) sim = np.sqrt(0.5 * ((np.sqrt(dense1) - np.sqrt(dense2))**2).sum()) 
like image 57
Radim Avatar answered Sep 27 '22 17:09

Radim


Don't know if this'll help but, I managed to attain successful results on document matching and similarities when using the actual document as a query.

dictionary = corpora.Dictionary.load('dictionary.dict')
corpus = corpora.MmCorpus("corpus.mm")
lda = models.LdaModel.load("model.lda") #result from running online lda (training)

index = similarities.MatrixSimilarity(lda[corpus])
index.save("simIndex.index")

docname = "docs/the_doc.txt"
doc = open(docname, 'r').read()
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lda = lda[vec_bow]

sims = index[vec_lda]
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print sims

Your similarity score between all documents residing in the corpus and the document that was used as a query will be the second index of every sim for sims.

like image 32
Palisand Avatar answered Sep 27 '22 17:09

Palisand


Provided answers are good, but they aren't very beginner-friendly. I want to start from training the LDA model and calculate cosine similarity.

Training model part:

docs = ["latent Dirichlet allocation (LDA) is a generative statistical model", 
        "each document is a mixture of a small number of topics",
        "each document may be viewed as a mixture of various topics"]

# Convert document to tokens
docs = [doc.split() for doc in docs]

# A mapping from token to id in each document
from gensim.corpora import Dictionary
dictionary = Dictionary(docs)

# Representing the corpus as a bag of words
corpus = [dictionary.doc2bow(doc) for doc in docs]

# Training the model
model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=10)

For extracting the probability assigned to each topic for a document, there are generally two ways. I provide here the both:

# Some preprocessing for documents like the training the model
test_doc = ["LDA is an example of a topic model",
            "topic modelling refers to the task of identifying topics"]
test_doc = [doc.split() for doc in test_doc]
test_corpus = [dictionary.doc2bow(doc) for doc in test_doc]

# Method 1
from gensim.matutils import cossim
doc1 = model.get_document_topics(test_corpus[0], minimum_probability=0)
doc2 = model.get_document_topics(test_corpus[1], minimum_probability=0)
print(cossim(doc1, doc2))

# Method 2
doc1 = model[test_corpus[0]]
doc2 = model[test_corpus[1]]
print(cossim(doc1, doc2))

output:

#Method 1
0.8279631530869963

#Method 2
0.828066885140262

As you can see both of the methods are generally the same, the difference is in the probabilities returned in the 2nd method sometimes doesn't add up to one as discussed here. For large corpus, the possibility vector could be given by passing the whole corpus:

#Method 1
possibility_vector = model.get_document_topics(test_corpus, minimum_probability=0)
#Method 2
possiblity_vector = model[test_corpus]

NOTE: The sum of probability assigned to each topic in a document may become a bit higher than 1 or in some cases a bit less than 1. That is because of the floating-point arithmetic rounding errors.

like image 42
eng.mrgh Avatar answered Sep 27 '22 19:09

eng.mrgh