Sometimes it returns probabilities for all topics and all is fine, but sometimes it returns probabilities for just a few topics and they don't add up to one, it seems it depends on the document. Generally when it returns few topics, the probabilities add up to more or less 80%, so is it returning just the most relevant topics? Is there a way to force it to return all probabilities?
Maybe I'm missing something but I can't find any documentation of the method's parameters.
I had the same problem and solved it by including the argument minimum_probability=0
when calling the get_document_topics
method of gensim.models.ldamodel.LdaModel
objects.
topic_assignments = lda.get_document_topics(corpus,minimum_probability=0)
By default, gensim doesn't output probabilities below 0.01, so for any document in particular, if there are any topics assigned probabilities under this threshold the sum of topic probabilities for that document will not add up to one.
Here's an example:
from gensim.test.utils import common_texts
from gensim.corpora.dictionary import Dictionary
from gensim.models.ldamodel import LdaModel
# Create a corpus from a list of texts
common_dictionary = Dictionary(common_texts)
common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]
# Train the model on the corpus.
lda = LdaModel(common_corpus, num_topics=100)
# Try values of minimum_probability argument of None (default) and 0
for minimum_probability in (None, 0):
# Get topic probabilites for each document
topic_assignments = lda.get_document_topics(common_corpus,minimum_probability=minimum_probability)
probabilities = [ [entry[1] for entry in doc] for doc in topic_assignments ]
# Print output
print(f"Calculating topic probabilities with minimum_probability argument = {str(minimum_probability)}")
print(f"Sum of probabilites:")
for i, P in enumerate(probabilities):
sum_P = sum(P)
print(f"\tdoc {i} = {sum_P}")
And the output would be:
Calculating topic probabilities with minimum_probability argument = None
Sum of probabilities:
doc 0 = 0.6733324527740479
doc 1 = 0.8585712909698486
doc 2 = 0.7549994885921478
doc 3 = 0.8019999265670776
doc 4 = 0.7524996995925903
doc 5 = 0
doc 6 = 0
doc 7 = 0
doc 8 = 0.5049992203712463
Calculating topic probabilities with minimum_probability argument = 0
Sum of probabilites:
doc 0 = 1.0000000400468707
doc 1 = 1.0000000337604433
doc 2 = 1.0000000079162419
doc 3 = 1.0000000284053385
doc 4 = 0.9999999937135726
doc 5 = 0.9999999776482582
doc 6 = 0.9999999776482582
doc 7 = 0.9999999776482582
doc 8 = 0.9999999930150807
This default behaviour is not very clearly stated in the documentation. The default value for minimum_probability
for the get_document_topics
method is None
, however this does not set the probability to zero. Instead the value of minimum_probability
is set to the value of minimum_probability
of the gensim.models.ldamodel.LdaModel
object, which by default is 0.01 as you can see in the source code:
def __init__(self, corpus=None, num_topics=100, id2word=None,
distributed=False, chunksize=2000, passes=1, update_every=1,
alpha='symmetric', eta=None, decay=0.5, offset=1.0, eval_every=10,
iterations=50, gamma_threshold=0.001, minimum_probability=0.01,
random_state=None, ns_conf=None, minimum_phi_value=0.01,
per_word_topics=False, callbacks=None, dtype=np.float32):
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With