Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Topic modelling in R using phrases rather than single words

I'm trying to do some topic modelling but want to use phrases where they exist rather than single words i.e.

library(topicmodels)
library(tm)
my.docs = c('the sky is blue, hot sun', 'flowers,hot sun', 'black cats, bees, rats and mice')
my.corpus = Corpus(VectorSource(my.docs))
my.dtm = DocumentTermMatrix(my.corpus)
inspect(my.dtm)

When I inspect my dtm it splits all the words up, but I want all the phrases together i.e. there should be a column for each of: the sky is blue hot sun flowers black cats bees rats and mice

How do make the Document Term Matrix recognise phrases and words? they are comma seperated

The solution needs to be efficient as I want to run it over a lot of data

like image 520
shecode Avatar asked Feb 02 '15 01:02

shecode


1 Answers

You might try an approach using a custom tokenizer. You define the multiple-word terms you want as phrases (I am not aware of an algorithmic code to do that step):

tokenizing.phrases <- c("sky is blue", "hot sun", "black cats")

Note that no stemming is done, so if you want both "black cats" and "black cat", then you will need to enter both variations. Case is ignored.

Then you need to create a function:

    phraseTokenizer <- function(x) {
      require(stringr)

      x <- as.character(x) # extract the plain text from the tm TextDocument object
      x <- str_trim(x)
      if (is.na(x)) return("")
      #warning(paste("doing:", x))
      phrase.hits <- str_detect(x, ignore.case(tokenizing.phrases))

      if (any(phrase.hits)) {
        # only split once on the first hit, so you don't have to worry about multiple occurrences of the same phrase
        split.phrase <- tokenizing.phrases[which(phrase.hits)[1]] 
        # warning(paste("split phrase:", split.phrase))
        temp <- unlist(str_split(x, ignore.case(split.phrase), 2))
        out <- c(phraseTokenizer(temp[1]), split.phrase, phraseTokenizer(temp[2])) 
      } else {
        out <- MC_tokenizer(x)
      }


 out[out != ""]
}

Then you proceed as normal to create a term document matrix, but this time you include the tokenized phrases in the corpus by means of the control argument.

tdm <- TermDocumentMatrix(corpus, control = list(tokenize = phraseTokenizer)) 
like image 161
lawyeR Avatar answered Sep 25 '22 13:09

lawyeR