Finding 2 & 3 word Phrases Using R TM Package

Tags:

I am trying to find a code that actually works to find the most frequently used two and three word phrases in R text mining package (maybe there is another package for it that I do not know). I have been trying to use the tokenizer, but seem to have no luck.

If you worked on a similar situation in the past, could you post a code that is tested and actually works? Thank you so much!

573

asked Jan 17 '12 16:01

appletree

6 Answers

You can pass in a custom tokenizing function to tm's DocumentTermMatrix function, so if you have package tau installed it's fairly straightforward.

library(tm); library(tau);

tokenize_ngrams <- function(x, n=3) return(rownames(as.data.frame(unclass(textcnt(x,method="string",n=n)))))

texts <- c("This is the first document.", "This is the second file.", "This is the third text.")
corpus <- Corpus(VectorSource(texts))
matrix <- DocumentTermMatrix(corpus,control=list(tokenize=tokenize_ngrams))

Where n in the tokenize_ngrams function is the number of words per phrase. This feature is also implemented in package RTextTools, which further simplifies things.

library(RTextTools)
texts <- c("This is the first document.", "This is the second file.", "This is the third text.")
matrix <- create_matrix(texts,ngramLength=3)

This returns a class of DocumentTermMatrix for use with package tm.

answered Oct 01 '22 20:10

Timothy P. Jurka

This is part 5 of the FAQ of the tm package:

5. Can I use bigrams instead of single tokens in a term-document matrix?

Yes. RWeka provides a tokenizer for arbitrary n-grams which can be directly passed on to the term-document matrix constructor. E.g.:

  library("RWeka")
  library("tm")

  data("crude")

  BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
  tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))

  inspect(tdm[340:345,1:10])

answered Oct 01 '22 21:10

Ben

This is my own made up creation for different purposes but I think may applicable to your needs too:

#User Defined Functions
Trim <- function (x) gsub("^\\s+|\\s+$", "", x)

breaker <- function(x) unlist(strsplit(x, "[[:space:]]|(?=[.!?*-])", perl=TRUE))

strip <- function(x, digit.remove = TRUE, apostrophe.remove = FALSE){
    strp <- function(x, digit.remove, apostrophe.remove){
        x2 <- Trim(tolower(gsub(".*?($|'|[^[:punct:]]).*?", "\\1", as.character(x))))
        x2 <- if(apostrophe.remove) gsub("'", "", x2) else x2
        ifelse(digit.remove==TRUE, gsub("[[:digit:]]", "", x2), x2)
    }
unlist(lapply(x, function(x) Trim(strp(x =x, digit.remove = digit.remove, 
    apostrophe.remove = apostrophe.remove)) ))
}

unblanker <- function(x)subset(x, nchar(x)>0)

#Fake Text Data
x <- "I like green eggs and ham.  They are delicious.  They taste so yummy.  I'm talking about ham and eggs of course"

#The code using Base R to Do what you want
breaker(x)
strip(x)
words <- unblanker(breaker(strip(x)))
textDF <- as.data.frame(table(words))
textDF$characters <- sapply(as.character(textDF$words), nchar)
textDF2 <- textDF[order(-textDF$characters, textDF$Freq), ]
rownames(textDF2) <- 1:nrow(textDF2)
textDF2
subset(textDF2, characters%in%2:3)

answered Oct 01 '22 19:10

Tyler Rinker

The corpus library has a function called term_stats that does what you want:

library(corpus)
corpus <- gutenberg_corpus(55) # Project Gutenberg #55, _The Wizard of Oz_
text_filter(corpus)$drop_punct <- TRUE # ignore punctuation
term_stats(corpus, ngrams = 2:3)
##    term             count support
## 1  of the             336       1
## 2  the scarecrow      208       1
## 3  to the             185       1
## 4  and the            166       1
## 5  said the           152       1
## 6  in the             147       1
## 7  the lion           141       1
## 8  the tin            123       1
## 9  the tin woodman    114       1
## 10 tin woodman        114       1
## 11 i am                84       1
## 12 it was              69       1
## 13 in a                64       1
## 14 the great           63       1
## 15 the wicked          61       1
## 16 wicked witch        60       1
## 17 at the              59       1
## 18 the little          59       1
## 19 the wicked witch    58       1
## 20 back to             57       1
## ⋮  (52511 rows total)

Here, count is the number of appearances, and support is the number of documents containing the term.

answered Oct 01 '22 19:10

Patrick Perry

I add a similar problem by using tm and ngram packages. After debugging mclapply, I saw there where problems on documents with less than 2 words with the following error

   input 'x' has nwords=1 and n=2; must have nwords >= n

So I've added a filter to remove document with low word count number:

    myCorpus.3 <- tm_filter(myCorpus.2, function (x) {
      length(unlist(strsplit(stringr::str_trim(x$content), '[[:blank:]]+'))) > 1
    })

Then my tokenize function looks like:

bigramTokenizer <- function(x) {
  x <- as.character(x)

  # Find words
  one.list <- c()
  tryCatch({
    one.gram <- ngram::ngram(x, n = 1)
    one.list <- ngram::get.ngrams(one.gram)
  }, 
  error = function(cond) { warning(cond) })

  # Find 2-grams
  two.list <- c()
  tryCatch({
    two.gram <- ngram::ngram(x, n = 2)
    two.list <- ngram::get.ngrams(two.gram)
  },
  error = function(cond) { warning(cond) })

  res <- unlist(c(one.list, two.list))
  res[res != '']
}

Then you can test the function with:

dtmTest <- lapply(myCorpus.3, bigramTokenizer)

And finally:

dtm <- DocumentTermMatrix(myCorpus.3, control = list(tokenize = bigramTokenizer))

answered Oct 01 '22 19:10

Géraud

Try tidytext package

library(dplyr)
library(tidytext)
library(janeaustenr)
library(tidyr

)

Suppose I have a dataframe CommentData that contains comment column and I want to find occurrence of two words together. Then try

bigram_filtered <- CommentData %>%
  unnest_tokens(bigram, Comment, token= "ngrams", n=2) %>%
  separate(bigram, c("word1","word2"), sep=" ") %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word) %>%
  count(word1, word2, sort=TRUE)

The above code creates tokens, and then remove stop words that doesn't help in analysis(eg. the,an,to etc.) Then you count occurrence of these words. You will be then using unite function to combine individual words and record their occurrence.

bigrams_united <- bigram_filtered %>%
  unite(bigram, word1, word2, sep=" ")
bigrams_united

answered Oct 01 '22 19:10

Monika Singh

Related questions
                            
                                How can I rename the output rows/cols of **ply functions from plyr?
                            
                                Estimating Weibull density parameters (error: "...initial value in 'vmmin' is not finite")
                            
                                How do I filter a data.frame in R by categorical variable?
                            
                                R draw kmeans clustering with heatmap
                            
                                Adding an element (vector) to a list in rpy2
                            
                                Plotting temporal TS and omitting NA data
                            
                                spplot() - make color.key look nice
                            
                                How to plot two lines in ggplot2
                            
                                Changing the Sweave driver from the command line
                            
                                accessing Facebook API from R for Text Mining
                            
                                (console) user interaction in R?
                            
                                define class methods and class variables in R5 reference class
                            
                                How to extract the pixel data Use R's pixmap package?
                            
                                How to page multiple plots in R in separate jpeg files?
                            
                                How do I add citations and a bibliography to "Rpres" rmarkdown presentations?
                            
                                Fastest way to subset - data.table vs. MySQL
                            
                                Is it possible to use non-imported packages in a package vignette?
                            
                                Change letter case of column names
                            
                                Align axis label on the right with ggplot2
                            
                                RSelenium: server signals port is already in use

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Finding 2 & 3 word Phrases Using R TM Package

Tags:

r

text-mining

data-mining

appletree

People also ask

6 Answers

Timothy P. Jurka

Ben

Tyler Rinker

Patrick Perry

Géraud

Monika Singh

Recent Activity

Donate For Us