I have read that the most common technique for topic modeling (extracting possible topics from text) is Latent Dirichlet allocation (LDA).
However, I am interested whether it is a good idea to try out topic modeling with Word2Vec as it clusters words in vector space. Couldn't the clusters therefore be regarded as topics?
Do you think it makes sense to follow this approach for the sake of some research? In the end what I am interested in is to extract keywords from text according to topics.
Word2Vec is a probabilistic method to learn word embedding (word vectors) from textual data corpus. Conceptually, it's a two-layer neural network that analyzes the corpus and produces a set of vectors that represents the words.
With LDA, documents are represented as bags of words. Each word contributes to a distribution over topics for the document which you can treat as a sort of document embedding. The contribution of the single words (or topic distribution for a single-word document) can be interpreted as a word embedding.
The Word2Vec model is used to extract the notion of relatedness across words or products such as semantic relatedness, synonym detection, concept categorization, selectional preferences, and analogy.
Using the novel approaches available with the word2vec model, it's easy to train very large vocabularies while achieving accurate results on machine learning tasks. Natural language processing is a complex field, but there are many libraries and tools for Python that make it easy to get started.
You might want to look at the following papers:
Dat Quoc Nguyen, Richard Billingsley, Lan Du and Mark Johnson. 2015. Improving Topic Models with Latent Feature Word Representations. Transactions of the Association for Computational Linguistics, vol. 3, pp. 299-313. [CODE]
Yang Liu, Zhiyuan Liu, Tat-Seng Chua, Maosong Sun. 2015. Topical Word Embeddings. In proceedings of 29th AAAI Conference on Artificial Intelligence, 2418-2424. [CODE]
The first paper integrates word embeddings into the LDA model and the one-topic-per-document DMM model. It reports significant improvements on topic coherence, document clustering and document classification tasks, especially on small corpora or short texts (e.g Tweets).
The second paper is also interesting. It uses LDA to assign topic for each word, and then employs Word2Vec to learn word embeddings based on both words and their topics.
Two people have tried to solve this.
Chris Moody at StichFix came out with LDA2Vec, and some Ph.D students at CMU wrote a paper called "Gaussian LDA for Topic Models with Word Embeddings" with code here... though I could not get the Java code there to output sensical results. Its an interesting idea of using word2vec with gaussian (actually T-distributions when you work out the math) word-topic distributions. Gaussian LDA should be able to handle out of vocabulary words from the training.
LDA2Vec attempts to train both the LDA model and word-vectors at the same time, and it also allows you to put LDA priors over non-words to get really interesting results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With