Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

LDA TopicModels producing list of numbers rather than terms

Tags:

r

lda

topicmodels

Bear with me as I am extremely new to this and working on a project for a course in a certificate program.

I have .csv dataset that I obtained by retrieving bibliometric records from Pubmed and Embase databases. There are 1034 rows. There are several columns, however, I am trying to create topic models from just one column, the Abstract column and some records do not have an abstract. I've done some processing (removing stopwords, punctuation, etc.) and have been able to barplot words occurring more than 200 times as well as create a Frequent Term list by rank and can also run word associations with selected words. So, it seems r is seeing the words themselves in the Abstract field. My issue comes when I try to create topic models using the topicmodels package. Here's the bit of code I'm using.

#including 1st 3 lines for reference
options(header = FALSE, stringsAsFactors = FALSE, FileEncoding = 
"latin1")
records <- read.csv("Combined.csv")
AbstractCorpus <- Corpus(VectorSource(records$Abstract))

AbstractTDM <- TermDocumentMatrix(AbstractCorpus)
library(topicmodels)
library(lda)
lda <- LDA(AbstractTDM, k = 8)
(term <- terms(lda, 6))
term <- (apply(term, MARGIN = 2, paste, collapse = ","))

However, the output of topics I get is the following.

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8

[1,] "499"   "733"   "390"   "833"   "17"    "413"   "719"   "392"  
[2,] "484"   "655"   "808"   "412"   "550"   "881"   "721"   "61"   
[3,] "857"   "299"   "878"   "909"   "15"    "258"   "47"    "164"  
[4,] "491"   "672"   "313"   "1028"  "126"   "55"    "375"   "987"  
[5,] "734"   "430"   "405"   "102"   "13"    "193"   "83"    "588"  
[6,] "403"   "52"    "489"   "10"    "598"   "52"    "933"   "980"  

Why am I not seeing words here rather than numbers?

Furthermore, the following code, which I basically took from the r PDF on topicmodels, does produce values for me, but the topics are still numbers rather than words, and this is meaningless to me.

#using information from topicmodels paper
library(tm)
library(topicmodels)
library(lda)
AbstractTM <- list(VEM = LDA(AbstractTDM, k = 10, control = list(seed =    
505)), VEM_fixed = LDA(AbstractTDM, k = 10, control = list(estimate.alpha 
= FALSE, seed = 505)), Gibbs = LDA(AbstractTDM, k = 10, method = "Gibbs", 
Control = list(seed = 505, burnin = 100, thin = 10, iter = 100)), CTM = 
CTM(AbstractTDM, k = 10, control = list(seed = 505, var = list(tol = 
10^-4), em = list(tol = 10^-3))))
#To compare the fitted models we first investigate the α values of the    
models fitted with VEM and α estimated and with VEM and α fixed 

sapply(AbstractTM[1:2], slot, "alpha")

#Find entropy 
sapply(AbstractTM, function(x)mean(apply(posterior(x)$topics, 1, 
function(z) - sum(z * log(z)))))

#Find estimated topics and terms
Topic <- topics(AbstractTM[["VEM"]], 1)
Topic
#find 5 most frequent terms for each topic
Terms <- terms(AbstractTM[["VEM"]], 5)
Terms[,1:5]

Any thoughts on what the issue might be?

like image 989
SciLibby Avatar asked Apr 17 '17 02:04

SciLibby


People also ask

What is the output of LDA?

LDA ( short for Latent Dirichlet Allocation ) is an unsupervised machine-learning model that takes documents as input and finds topics as output. The model also says in what percentage each document talks about each topic. A topic is represented as a weighted list of words.

How do you determine the number of topics in LDA?

To decide on a suitable number of topics, you can compare the goodness-of-fit of LDA models fit with varying numbers of topics. You can evaluate the goodness-of-fit of an LDA model by calculating the perplexity of a held-out set of documents. The perplexity indicates how well the model describes a set of documents.

How LDA in topic modeling represents the documents and words of the text?

LDA is applied to the text data. It works by decomposing the corpus document word matrix (the larger matrix) into two parts (smaller matrices): the Document Topic Matrix and the Topic Word. Therefore, LDA like PCA is a matrix factorization technique.

How do you interpret a coherence score?

We can use the coherence score in topic modeling to measure how interpretable the topics are to humans. In this case, topics are represented as the top N words with the highest probability of belonging to that particular topic. Briefly, the coherence score measures how similar these words are to each other.


1 Answers

Reading the topicmodels documentation, it does appear that the LDA() function expects a DocumentTermMatrix, not a TermDocumentMatrix. Try creating the former with DocumentTermMatrix(AbstractCorpus) and see if that works.

like image 165
Kara Woo Avatar answered Sep 22 '22 16:09

Kara Woo