I'm new to topic modelling / Latent Dirichlet Allocation and have trouble understanding how I can apply the concept to my dataset (or whether it's the correct approach).
I have a small number of literary texts (novels) and would like to extract some general topics using LDA.
I'm using the gensim
module in Python along with some nltk
features. For a test I've split up my original texts (just 6) into 30 chunks with 1000 words each. Then I converted the chunks into document-term matrices and ran the algorithm. This is the code (although I think it doesn't matter for the question) :
# chunks is a 30x1000 words matrix
dictionary = gensim.corpora.dictionary.Dictionary(chunks)
corpus = [ dictionary.doc2bow(chunk) for chunk in chunks ]
lda = gensim.models.ldamodel.LdaModel(corpus = corpus, id2word = dictionary,
num_topics = 10)
topics = lda.show_topics(5, 5)
However the result is completely different from any example I've seen in that the topics are full of meaningless words that can be found in all source documents, e.g. "I", "he", "said", "like", ... example:
[(2, '0.009*"I" + 0.007*"\'s" + 0.007*"The" + 0.005*"would" + 0.004*"He"'),
(8, '0.012*"I" + 0.010*"He" + 0.008*"\'s" + 0.006*"n\'t" + 0.005*"The"'),
(9, '0.022*"I" + 0.014*"\'s" + 0.009*"``" + 0.007*"\'\'" + 0.007*"like"'),
(7, '0.010*"\'s" + 0.009*"I" + 0.006*"He" + 0.005*"The" + 0.005*"said"'),
(1, '0.009*"I" + 0.009*"\'s" + 0.007*"n\'t" + 0.007*"The" + 0.006*"He"')]
I don't quite understand why that happens, or why it doesn't happen with the examples I've seen. How do I get the LDA model to find more distinctive topics with less overlap? Is it a matter of filtering out more common words first? How can I adjust how many times the model runs? Is the number of original texts too small?
LDA is extremely dependent on the words used in a corpus and how frequently they show up. The words you are seeing are all stopwords - meaningless words that are the most frequent words in a language e.g. "the", "I", "a", "if", "for", "said" etc. and since these words are the most frequent, it will negatively impact the model.
I would use the nltk
stopword corpus to filter out these words:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
Then make sure your text does not contain any of the words in the stop_words
list (by whatever pre processing method you are using) - an example is below
text = text.split() # split words by space and convert to list
text = [word for word in text if word not in stop_words]
text = ' '.join(text) # join the words in the text to make it a continuous string again
You may also want to remove punctuation and other characters ("/","-") etc.) then use regular expressions:
import re
remove_punctuation_regex = re.compile(r"[^A-Za-z ]") # regex for all characters that are NOT A-Z, a-z and space " "
text = re.sub(remove_punctuation_regex, "", text) # sub all non alphabetical characters with empty string ""
Finally, you may also want to filter on most frequent or least frequent words in your corpus, which you can do using nltk:
from nltk import FreqDist
all_words = text.split() # list of all the words in your corpus
fdist = FreqDist(all_words) # a frequency distribution of words (word count over the corpus)
k = 10000 # say you want to see the top 10,000 words
top_k_words, _ = zip(*fdist.most_common(k)) # unzip the words and word count tuples
print(top_k_words) # print the words and inspect them to see which ones you want to keep and which ones you want to disregard
That should get rid of the stopwords and extra characters, but still leaves the vast problem of topic modelling (which I wont try to explain here but will leave some tips and links).
Assuming you know a little bit about topic modelling, lets start. LDA is a bag of words model, meaning word order doesnt matter. The model assigns a topic distribution (of a predetermined number of topics K) to each document, and a word distribution to each topic. A very insightful high level video explains this here. If you want to see more of the mathematics, but still at an accessible level, check out this video. The more documents the better, and usually longer documents (with more words) also fair better using LDA - this paper shows that LDA doesnt perform well with short texts (less than ~20 words). K is up to you to choose, and really depends on your corpus of documents (how large it is, what different topics it covers etc.). Usually a good value of K is between 100-300, but again this really depends on your corpus.
LDA has two hyperparamters, alpha and beta (alpha and eta in gemsim) - a higher alpha means each text will be represented by more topics (so naturally a lower alpha means each text will be represented by less topics). A high eta means each topic is represented by more words, and a low eta means each topic is represented by less words - so with a low eta you would get less "overlap" between topics.
There's many insights you could gain using LDA
What are the topics in a corpus (naming topics may not matter to your application, but if it does this can be done by inspecting the words in a topic as you have done above)
What words contribute most to a topic
What documents in the corpus are most similar (using a similarity metric)
Hope this has helped. I was new to LDA a few months ago but I've quickly gotten up to speed using stackoverflow and youtube!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With