Bear with me as I am extremely new to this and working on a project for a course in a certificate program.
I have .csv dataset that I obtained by retrieving bibliometric records from Pubmed and Embase databases. There are 1034 rows. There are several columns, however, I am trying to create topic models from just one column, the Abstract column and some records do not have an abstract. I've done some processing (removing stopwords, punctuation, etc.) and have been able to barplot words occurring more than 200 times as well as create a Frequent Term list by rank and can also run word associations with selected words. So, it seems r is seeing the words themselves in the Abstract field. My issue comes when I try to create topic models using the topicmodels package. Here's the bit of code I'm using.
#including 1st 3 lines for reference
options(header = FALSE, stringsAsFactors = FALSE, FileEncoding =
"latin1")
records <- read.csv("Combined.csv")
AbstractCorpus <- Corpus(VectorSource(records$Abstract))
AbstractTDM <- TermDocumentMatrix(AbstractCorpus)
library(topicmodels)
library(lda)
lda <- LDA(AbstractTDM, k = 8)
(term <- terms(lda, 6))
term <- (apply(term, MARGIN = 2, paste, collapse = ","))
However, the output of topics I get is the following.
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8
[1,] "499" "733" "390" "833" "17" "413" "719" "392"
[2,] "484" "655" "808" "412" "550" "881" "721" "61"
[3,] "857" "299" "878" "909" "15" "258" "47" "164"
[4,] "491" "672" "313" "1028" "126" "55" "375" "987"
[5,] "734" "430" "405" "102" "13" "193" "83" "588"
[6,] "403" "52" "489" "10" "598" "52" "933" "980"
Why am I not seeing words here rather than numbers?
Furthermore, the following code, which I basically took from the r PDF on topicmodels, does produce values for me, but the topics are still numbers rather than words, and this is meaningless to me.
#using information from topicmodels paper
library(tm)
library(topicmodels)
library(lda)
AbstractTM <- list(VEM = LDA(AbstractTDM, k = 10, control = list(seed =
505)), VEM_fixed = LDA(AbstractTDM, k = 10, control = list(estimate.alpha
= FALSE, seed = 505)), Gibbs = LDA(AbstractTDM, k = 10, method = "Gibbs",
Control = list(seed = 505, burnin = 100, thin = 10, iter = 100)), CTM =
CTM(AbstractTDM, k = 10, control = list(seed = 505, var = list(tol =
10^-4), em = list(tol = 10^-3))))
#To compare the fitted models we first investigate the α values of the
models fitted with VEM and α estimated and with VEM and α fixed
sapply(AbstractTM[1:2], slot, "alpha")
#Find entropy
sapply(AbstractTM, function(x)mean(apply(posterior(x)$topics, 1,
function(z) - sum(z * log(z)))))
#Find estimated topics and terms
Topic <- topics(AbstractTM[["VEM"]], 1)
Topic
#find 5 most frequent terms for each topic
Terms <- terms(AbstractTM[["VEM"]], 5)
Terms[,1:5]
Any thoughts on what the issue might be?
LDA ( short for Latent Dirichlet Allocation ) is an unsupervised machine-learning model that takes documents as input and finds topics as output. The model also says in what percentage each document talks about each topic. A topic is represented as a weighted list of words.
To decide on a suitable number of topics, you can compare the goodness-of-fit of LDA models fit with varying numbers of topics. You can evaluate the goodness-of-fit of an LDA model by calculating the perplexity of a held-out set of documents. The perplexity indicates how well the model describes a set of documents.
LDA is applied to the text data. It works by decomposing the corpus document word matrix (the larger matrix) into two parts (smaller matrices): the Document Topic Matrix and the Topic Word. Therefore, LDA like PCA is a matrix factorization technique.
We can use the coherence score in topic modeling to measure how interpretable the topics are to humans. In this case, topics are represented as the top N words with the highest probability of belonging to that particular topic. Briefly, the coherence score measures how similar these words are to each other.
Reading the topicmodels documentation, it does appear that the LDA()
function expects a DocumentTermMatrix
, not a TermDocumentMatrix
. Try creating the former with DocumentTermMatrix(AbstractCorpus)
and see if that works.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With