How to predict the topic of a new query using a trained LDA model using gensim?

Tags:

I have trained a corpus for LDA topic modelling using gensim.

Going through the tutorial on the gensim website (this is not the whole code):

question = 'Changelog generation from Github issues?';

temp = question.lower()
for i in range(len(punctuation_string)):
    temp = temp.replace(punctuation_string[i], '')

words = re.findall(r'\w+', temp, flags = re.UNICODE | re.LOCALE)
important_words = []
important_words = filter(lambda x: x not in stoplist, words)
print important_words
dictionary = corpora.Dictionary.load('questions.dict')
ques_vec = []
ques_vec = dictionary.doc2bow(important_words)
print dictionary
print ques_vec
print lda[ques_vec]

This is the output that I get:

['changelog', 'generation', 'github', 'issues']
Dictionary(15791 unique tokens)
[(514, 1), (3625, 1), (3626, 1), (3627, 1)]
[(4, 0.20400000000000032), (11, 0.20400000000000032), (19, 0.20263215848547525), (29, 0.20536784151452539)]

I don't know how the last output is going to help me find the possible topic for the question !!!

Please help!

638

asked Apr 28 '13 10:04

Animesh Pandey

1 Answers

I have written a function in python that gives the possible topic for a new query:

def getTopicForQuery (question):
    temp = question.lower()
    for i in range(len(punctuation_string)):
        temp = temp.replace(punctuation_string[i], '')

    words = re.findall(r'\w+', temp, flags = re.UNICODE | re.LOCALE)

    important_words = []
    important_words = filter(lambda x: x not in stoplist, words)

    dictionary = corpora.Dictionary.load('questions.dict')

    ques_vec = []
    ques_vec = dictionary.doc2bow(important_words)

    topic_vec = []
    topic_vec = lda[ques_vec]

    word_count_array = numpy.empty((len(topic_vec), 2), dtype = numpy.object)
    for i in range(len(topic_vec)):
        word_count_array[i, 0] = topic_vec[i][0]
        word_count_array[i, 1] = topic_vec[i][1]

    idx = numpy.argsort(word_count_array[:, 1])
    idx = idx[::-1]
    word_count_array = word_count_array[idx]

    final = []
    final = lda.print_topic(word_count_array[0, 0], 1)

    question_topic = final.split('*') ## as format is like "probability * topic"

    return question_topic[1]

Before going through this do refer this link!

In the initial part of the code, the query is being pre-processed so that it can be stripped off stop words and unnecessary punctuations.

Then, the dictionary that was made by using our own database is loaded.

We, then, we convert the tokens of the new query to bag of words and then the topic probability distribution of the query is calculated by topic_vec = lda[ques_vec] where lda is the trained model as explained in the link referred above.

The distribution is then sorted w.r.t the probabilities of the topics. The topic with the highest probability is then displayed by question_topic[1].

110

answered Sep 20 '22 23:09

Animesh Pandey

Related questions
                            
                                python regular expression matching anything
                            
                                Use lxml to parse text file with bad header in Python
                            
                                Selenium WebDriver (2.25) Timeout Not Working
                            
                                How do I display and close an image with Python?
                            
                                Data type error with drawContours unless I pickle/unpickle first
                            
                                Dynamically change widget background color in Tkinter
                            
                                python compare datetimes with different timezones
                            
                                Python regex compile (with re.VERBOSE) not working
                            
                                Extract text with lxml.html
                            
                                Convert pyBarcode output to PIL Image file
                            
                                python: recurcive list processing changes original list
                            
                                win32gui get the current active application name
                            
                                Manipulating the numpy.random.exponential distribution in Python
                            
                                Bug in Python's documentation?
                            
                                python http/udp bittorrent tracker scrape library
                            
                                How to selectively import module in python?
                            
                                How do i test/refactor my tests?
                            
                                Using git submodule to import a python project
                            
                                DBSCAN with python and scikit-learn: What exactly are the integer labes returned by make_blobs?
                            
                                merging xml files using python's ElementTree

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to predict the topic of a new query using a trained LDA model using gensim?

Tags:

python

nlp

gensim

lda

topic-modeling

Animesh Pandey

People also ask

1 Answers

Animesh Pandey

Recent Activity

Donate For Us