I have read that the most common technique for topic modeling (extracting possible topics from text) is Latent Dirichlet allocation (LDA). However, I am interested whether it is a good idea to try out topic modeling with Word2Vec as it clusters words in vector space. Couldn't the clusters therefore be regarded as topics? Do you think it makes sense to follow this approach for the sake of some research? In the end what I am interested in is to extract keywords from text according to topics.

You might want to look at the following papers: Dat Quoc Nguyen, Richard Billingsley, Lan Du and Mark Johnson. 2015. Improving Topic Models with Latent Feature Word Representations. Transactions of the Association for Computational Linguistics, vol. 3, pp. 299-313. [CODE] Yang Liu, Zhiyuan Liu, Tat-Seng Chua, Maosong Sun. 2015. Topical Word Embeddings. In proceedings of 29th AAAI Conference on Artificial Intelligence, 2418-2424. [CODE] The first paper integrates word embeddings into the LDA model and the one-topic-per-document DMM model. It reports significant improvements on topic coherence, document clustering and document classification tasks, especially on small corpora or short texts (e.g Tweets). The second paper is also interesting. It uses LDA to assign topic for each word, and then employs Word2Vec to learn word embeddings based on both words and their topics.

Using Word2Vec for topic modeling

2 Answers

You might want to look at the following papers:

Dat Quoc Nguyen, Richard Billingsley, Lan Du and Mark Johnson. 2015. Improving Topic Models with Latent Feature Word Representations. Transactions of the Association for Computational Linguistics, vol. 3, pp. 299-313. [CODE]

Yang Liu, Zhiyuan Liu, Tat-Seng Chua, Maosong Sun. 2015. Topical Word Embeddings. In proceedings of 29th AAAI Conference on Artificial Intelligence, 2418-2424. [CODE]

The first paper integrates word embeddings into the LDA model and the one-topic-per-document DMM model. It reports significant improvements on topic coherence, document clustering and document classification tasks, especially on small corpora or short texts (e.g Tweets).

The second paper is also interesting. It uses LDA to assign topic for each word, and then employs Word2Vec to learn word embeddings based on both words and their topics.

110

answered Oct 16 '22 07:10

NQD

Two people have tried to solve this.

Chris Moody at StichFix came out with LDA2Vec, and some Ph.D students at CMU wrote a paper called "Gaussian LDA for Topic Models with Word Embeddings" with code here... though I could not get the Java code there to output sensical results. Its an interesting idea of using word2vec with gaussian (actually T-distributions when you work out the math) word-topic distributions. Gaussian LDA should be able to handle out of vocabulary words from the training.

LDA2Vec attempts to train both the LDA model and word-vectors at the same time, and it also allows you to put LDA priors over non-words to get really interesting results.

answered Oct 16 '22 08:10

Mansweet

Related questions
                            
                                What are some good ways of estimating 'approximate' semantic similarity between sentences?
                            
                                ARPA language model documentation
                            
                                Extracting food items from sentences
                            
                                nltk tokenization and contractions
                            
                                How does spacy use word embeddings for Named Entity Recognition (NER)?
                            
                                stanford core nlp java output
                            
                                Computing TF-IDF on the whole dataset or only on training data?
                            
                                SOLR and Natural Language Parsing - Can I use it?
                            
                                What is "unk" in the pretrained GloVe vector files (e.g. glove.6B.50d.txt)?
                            
                                How to correct the user input (Kind of google "did you mean?")
                            
                                N-grams: Explanation + 2 applications
                            
                                Get selected feature names TFIDF Vectorizer
                            
                                Spacy nlp = spacy.load("en_core_web_lg")
                            
                                Fast n-gram calculation
                            
                                LDA model generates different topics everytime i train on the same corpus
                            
                                Tools for text simplification (Java) [closed]
                            
                                How to use OpenNLP with Java?
                            
                                Unable to load the spacy model 'en_core_web_lg' on Google colab
                            
                                Interpreting negative Word2Vec similarity from gensim
                            
                                Algorithm for Negating Sentences

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using Word2Vec for topic modeling

Tags:

nlp

word2vec

topic-modeling

user1814735

People also ask

2 Answers

NQD

Mansweet

Recent Activity

Donate For Us