I'm trying to do some topic modelling but want to use phrases where they exist rather than single words i.e.
library(topicmodels)
library(tm)
my.docs = c('the sky is blue, hot sun', 'flowers,hot sun', 'black cats, bees, rats and mice')
my.corpus = Corpus(VectorSource(my.docs))
my.dtm = DocumentTermMatrix(my.corpus)
inspect(my.dtm)
When I inspect my dtm it splits all the words up, but I want all the phrases together i.e. there should be a column for each of: the sky is blue hot sun flowers black cats bees rats and mice
How do make the Document Term Matrix recognise phrases and words? they are comma seperated
The solution needs to be efficient as I want to run it over a lot of data
You might try an approach using a custom tokenizer. You define the multiple-word terms you want as phrases (I am not aware of an algorithmic code to do that step):
tokenizing.phrases <- c("sky is blue", "hot sun", "black cats")
Note that no stemming is done, so if you want both "black cats" and "black cat", then you will need to enter both variations. Case is ignored.
Then you need to create a function:
phraseTokenizer <- function(x) {
require(stringr)
x <- as.character(x) # extract the plain text from the tm TextDocument object
x <- str_trim(x)
if (is.na(x)) return("")
#warning(paste("doing:", x))
phrase.hits <- str_detect(x, ignore.case(tokenizing.phrases))
if (any(phrase.hits)) {
# only split once on the first hit, so you don't have to worry about multiple occurrences of the same phrase
split.phrase <- tokenizing.phrases[which(phrase.hits)[1]]
# warning(paste("split phrase:", split.phrase))
temp <- unlist(str_split(x, ignore.case(split.phrase), 2))
out <- c(phraseTokenizer(temp[1]), split.phrase, phraseTokenizer(temp[2]))
} else {
out <- MC_tokenizer(x)
}
out[out != ""]
}
Then you proceed as normal to create a term document matrix, but this time you include the tokenized phrases in the corpus by means of the control argument.
tdm <- TermDocumentMatrix(corpus, control = list(tokenize = phraseTokenizer))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With