<h3>LDA Original Output</h3> <ul> <li> <p>Uni-grams </p> <ul> <li><p>topic1 -scuba,water,vapor,diving</p></li> <li><p>topic2 -dioxide,plants,green,carbon</p></li> </ul> </li> </ul> <h3>Required Output</h3> <ul> <li> <p>Bi-gram topics</p> <ul> <li><p>topic1 -scuba diving,water vapor</p></li> <li><p>topic2 -green plants,carbon dioxide</p></li> </ul> </li> </ul> <p>Any idea?</p>

<p>Given I have a dict called <code>docs</code>, containing lists of words from documents, I can turn it into an array of words + bigrams (or also trigrams etc.) using nltk.util.ngrams or your own function like this:</p> <pre class="prettyprint"><code>from nltk.util import ngrams for doc in docs: docs[doc] = docs[doc] + ["_".join(w) for w in ngrams(docs[doc], 2)] </code></pre> <p>Then you pass the values of this dict to the LDA model as a corpus. Bigrams joined by underscores are thus treated as single tokens.</p>

How to abstract bigram topics instead of unigrams using Latent Dirichlet Allocation (LDA) in python- gensim?

LDA Original Output

Uni-grams
- topic1 -scuba,water,vapor,diving
- topic2 -dioxide,plants,green,carbon

Required Output

Bi-gram topics
- topic1 -scuba diving,water vapor
- topic2 -green plants,carbon dioxide

Any idea?

672

asked Sep 09 '15 09:09

Thomas N T

1 Answers

Given I have a dict called docs, containing lists of words from documents, I can turn it into an array of words + bigrams (or also trigrams etc.) using nltk.util.ngrams or your own function like this:

from nltk.util import ngrams

for doc in docs:
    docs[doc] = docs[doc] + ["_".join(w) for w in ngrams(docs[doc], 2)]

Then you pass the values of this dict to the LDA model as a corpus. Bigrams joined by underscores are thus treated as single tokens.

answered Sep 28 '22 07:09

noisefield

Related questions
                            
                                Algorithm to compare similarity of ideas (as strings)
                            
                                Fake reviews datasets
                            
                                Quick NLTK parse into syntax tree
                            
                                How to plot SVC classification for an unbalanced dataset with scikit-learn and matplotlib?
                            
                                Elasticsearch: getting the tf-idf of every term in a given document
                            
                                How to compute perplexity using KenLM?
                            
                                Extracting the person names in the named entity recognition in NLP using Python
                            
                                Train Spacy NER on Indian Names
                            
                                Spacy - nlp.pipe() returns generator
                            
                                Lemmatize a doc with spacy?
                            
                                How can a machine learning model handle unseen data and unseen label?
                            
                                How to get token ids using spaCy (I want to map a text sentence to sequence of integers)
                            
                                `return_sequences = False` equivalent in pytorch LSTM
                            
                                How to find singular in the plural when some letters change? What is the best approach?
                            
                                Natural Language Processing Package
                            
                                Anyone know of some good Word Sense Disambiguation software? [closed]
                            
                                stanford Core NLP: Splitting sentences from text
                            
                                Algorithm to generate context free grammar from any regex
                            
                                Lexicon dictionary for synonym words
                            
                                Difference between Semantic Web and NLP?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to abstract bigram topics instead of unigrams using Latent Dirichlet Allocation (LDA) in python- gensim?

Tags:

nlp

text-mining

gensim

lda