Topic modelling, but with known topics?

Tags:

topic-modeling

Okay, so usually topic models (such as LDA, pLSI, etc.) are used to infer topics that may be present in a set of documents, in an unsupervised fashion. I would like to know if anyone has any ideas as to how I can shoehorn my problem into an LDA framework, as there are very good tools available to solve LDA problems.

For the sake of being thorough, I have the following pieces of information as input:

A set of documents (segments of DNA from one organism, where each segment is a document)
- A document can only have one topic in this scenario
A set of topics (segments of DNA from other organisms)
Words in this case are triplets of bases (for now)

The question I want to answer is: For the current document, what is its topic? In other words, for the given DNA segment, which other organism (same species) did it most likely come from? There could have been mutations and such since the exchange of segments occurred, so the two segments won't be identical.

The main difference between this and the classical LDA model is that I know the topics ahead of time.

My initial idea was to take a pLSA model (http://en.wikipedia.org/wiki/PLSA) and just set the topic nodes explicitly, then perform standard EM learning (if only there were a decent library that could handle Bayesian parameter learning with latent variables...), followed by inference using whatever algorithm (which shouldn't matter, because the model is a polytree anyway).

Edit: I think I've solved it, for anyone who might stumble across this. I figured out that you can use labelled LDA and just assign every label to every document. Since each label has a one-to-one correspondence with a topic, you're effectively saying to the algorithm: for each document, choose the topic from this given set of topics (the label set), instead of making up your own.

581

asked May 28 '13 00:05

user1871183

1 Answers

I have a similar problem, and just thought I'd add the solutions I'm going with for completeness's sake.

I also have a set of documents (pdf documents anywhere from 1 to 200 pages), though mine are regular English text data.
A set of known topics (mine include subtopics, but I won't address that here). Unlike the previous example, I may desire multiple topic labels.
Words (standard English, though named entities and acronyms are included in my corpus)

LDAesk approach: Guided LDA

Guided LDA lets you seed words for your LDA categories. If you have n-topics for your final decisions you just create your guidedLDA algorithm with n-seed topics, each of which contain the keywords that makeup their topic name. Eg: I want to cluster into known topics "biochemistry" and "physics". Then I seed my guidedLDA with d = {0: ['biochemsitry'], 1: ['physics']}. You can incorporate other guiding words if you can identify them, however the guidedLDA algorithm I'm using (python version) makes it relatively easy to identify the top n-words for a given topic. You can run guidedLDA once with only basic seed words then use the top n-words output to consider for more words to add to topics. These top n-words also are potentially helpful for the other approach I'm mentioning.

Non-LDAesk approach: ~KNN

What I've ended up doing is using a word embedding model (word2vec has been superior to alternatives for my case) to create a "topic vector" for every topic based on the words that make up the topic/subtopic. Eg: I have a category Biochemistry with a subcategory Molecular Biology. The most basic topic vector is just the word2vec vectors for Biochemistry, Molecular, and Biology all averaged together. For every document I want to determine a topic for, I turn it into a "document vector" (same dimension & embedding model as how I made my topic vectors - I've found just averaging all the word2vec vectors in the doc has been the best solution for my so far, after a bit of preprocessing like removing stopwords). Then I just find the k-closest topic vectors to the input document vector. I should note that there's some ability to hand tune this by changing the words that makeup the topic vectors. One way to potentially identify further keywords is to use the guidedLDA model I mentioned earlier.

I would note that when I was testing these two solutions on a different corpus with labeled data (which I didn't use aside from evaluating accuracy and such) this ~KNN approach proved better than the GuidedLDA approach.

answered Oct 01 '22 07:10

Evan Mata

Related questions
                            
                                Using Topic Model, how should we set up a "stop words" list?
                            
                                pyldavis Unable to view the graph
                            
                                probabilities returned by gensim's get_document_topics method doesn't add up to one
                            
                                Topic modeling on short texts Python
                            
                                How to interpret Sklearn LDA perplexity score. Why it always increase as number of topics increase?
                            
                                What do the parameters of the csvIterator mean in Mallet?
                            
                                Run cvb in mahout 0.8
                            
                                Automatic labeling of LDA generated topics
                            
                                GSDMM Convergence of Clusters (Short Text Clustering)
                            
                                hierarchical classification + topic model training data for internet articles and social media
                            
                                Topic Modeling in Mallet; Documentation

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With