Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding 2 & 3 word Phrases Using R TM Package

I am trying to find a code that actually works to find the most frequently used two and three word phrases in R text mining package (maybe there is another package for it that I do not know). I have been trying to use the tokenizer, but seem to have no luck.

If you worked on a similar situation in the past, could you post a code that is tested and actually works? Thank you so much!

like image 573
appletree Avatar asked Jan 17 '12 16:01

appletree


People also ask

Is Dory a girl?

According to director Andrew Stanton on the audio commentary for the Finding Nemo DVD, in the original story, Dory was going to be a male character but when Stanton went home to write the script his wife was watching The Ellen DeGeneres Show and when he heard DeGeneres' voice he decided to change Dory to a female and ...

What does Dory suffer from?

Amnesia in the movies. The character Dory from the movies Finding Nemo and Finding Dory is an example of a movie character who has amnesia, or memory loss. Some of the things that Dory does in the movies are a lot like real-life amnesia. For example, Dory forgets that she has met Marlin, another character in the movie.

Are Dory's parents alive?

Other blue tangs tell them that Dory's parents escaped from the institute a long time ago to search for her and never came back, leaving Dory to believe that they have died. Hank retrieves Dory from the tank, accidentally leaving Marlin and Nemo behind.

Is Finding Dory AU?

Finding Dory is available to stream in Australia now on Google Play and Apple TV and Disney+ and Prime Video Store.


6 Answers

You can pass in a custom tokenizing function to tm's DocumentTermMatrix function, so if you have package tau installed it's fairly straightforward.

library(tm); library(tau);

tokenize_ngrams <- function(x, n=3) return(rownames(as.data.frame(unclass(textcnt(x,method="string",n=n)))))

texts <- c("This is the first document.", "This is the second file.", "This is the third text.")
corpus <- Corpus(VectorSource(texts))
matrix <- DocumentTermMatrix(corpus,control=list(tokenize=tokenize_ngrams))

Where n in the tokenize_ngrams function is the number of words per phrase. This feature is also implemented in package RTextTools, which further simplifies things.

library(RTextTools)
texts <- c("This is the first document.", "This is the second file.", "This is the third text.")
matrix <- create_matrix(texts,ngramLength=3)

This returns a class of DocumentTermMatrix for use with package tm.

like image 98
Timothy P. Jurka Avatar answered Oct 01 '22 20:10

Timothy P. Jurka


This is part 5 of the FAQ of the tm package:

5. Can I use bigrams instead of single tokens in a term-document matrix?

Yes. RWeka provides a tokenizer for arbitrary n-grams which can be directly passed on to the term-document matrix constructor. E.g.:

  library("RWeka")
  library("tm")

  data("crude")

  BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
  tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))

  inspect(tdm[340:345,1:10])
like image 21
Ben Avatar answered Oct 01 '22 21:10

Ben


This is my own made up creation for different purposes but I think may applicable to your needs too:

#User Defined Functions
Trim <- function (x) gsub("^\\s+|\\s+$", "", x)

breaker <- function(x) unlist(strsplit(x, "[[:space:]]|(?=[.!?*-])", perl=TRUE))

strip <- function(x, digit.remove = TRUE, apostrophe.remove = FALSE){
    strp <- function(x, digit.remove, apostrophe.remove){
        x2 <- Trim(tolower(gsub(".*?($|'|[^[:punct:]]).*?", "\\1", as.character(x))))
        x2 <- if(apostrophe.remove) gsub("'", "", x2) else x2
        ifelse(digit.remove==TRUE, gsub("[[:digit:]]", "", x2), x2)
    }
unlist(lapply(x, function(x) Trim(strp(x =x, digit.remove = digit.remove, 
    apostrophe.remove = apostrophe.remove)) ))
}

unblanker <- function(x)subset(x, nchar(x)>0)

#Fake Text Data
x <- "I like green eggs and ham.  They are delicious.  They taste so yummy.  I'm talking about ham and eggs of course"

#The code using Base R to Do what you want
breaker(x)
strip(x)
words <- unblanker(breaker(strip(x)))
textDF <- as.data.frame(table(words))
textDF$characters <- sapply(as.character(textDF$words), nchar)
textDF2 <- textDF[order(-textDF$characters, textDF$Freq), ]
rownames(textDF2) <- 1:nrow(textDF2)
textDF2
subset(textDF2, characters%in%2:3)
like image 36
Tyler Rinker Avatar answered Oct 01 '22 19:10

Tyler Rinker


The corpus library has a function called term_stats that does what you want:

library(corpus)
corpus <- gutenberg_corpus(55) # Project Gutenberg #55, _The Wizard of Oz_
text_filter(corpus)$drop_punct <- TRUE # ignore punctuation
term_stats(corpus, ngrams = 2:3)
##    term             count support
## 1  of the             336       1
## 2  the scarecrow      208       1
## 3  to the             185       1
## 4  and the            166       1
## 5  said the           152       1
## 6  in the             147       1
## 7  the lion           141       1
## 8  the tin            123       1
## 9  the tin woodman    114       1
## 10 tin woodman        114       1
## 11 i am                84       1
## 12 it was              69       1
## 13 in a                64       1
## 14 the great           63       1
## 15 the wicked          61       1
## 16 wicked witch        60       1
## 17 at the              59       1
## 18 the little          59       1
## 19 the wicked witch    58       1
## 20 back to             57       1
## ⋮  (52511 rows total)

Here, count is the number of appearances, and support is the number of documents containing the term.

like image 23
Patrick Perry Avatar answered Oct 01 '22 19:10

Patrick Perry


I add a similar problem by using tm and ngram packages. After debugging mclapply, I saw there where problems on documents with less than 2 words with the following error

   input 'x' has nwords=1 and n=2; must have nwords >= n

So I've added a filter to remove document with low word count number:

    myCorpus.3 <- tm_filter(myCorpus.2, function (x) {
      length(unlist(strsplit(stringr::str_trim(x$content), '[[:blank:]]+'))) > 1
    })

Then my tokenize function looks like:

bigramTokenizer <- function(x) {
  x <- as.character(x)

  # Find words
  one.list <- c()
  tryCatch({
    one.gram <- ngram::ngram(x, n = 1)
    one.list <- ngram::get.ngrams(one.gram)
  }, 
  error = function(cond) { warning(cond) })

  # Find 2-grams
  two.list <- c()
  tryCatch({
    two.gram <- ngram::ngram(x, n = 2)
    two.list <- ngram::get.ngrams(two.gram)
  },
  error = function(cond) { warning(cond) })

  res <- unlist(c(one.list, two.list))
  res[res != '']
}

Then you can test the function with:

dtmTest <- lapply(myCorpus.3, bigramTokenizer)

And finally:

dtm <- DocumentTermMatrix(myCorpus.3, control = list(tokenize = bigramTokenizer))
like image 30
Géraud Avatar answered Oct 01 '22 19:10

Géraud


Try tidytext package

library(dplyr)
library(tidytext)
library(janeaustenr)
library(tidyr

)

Suppose I have a dataframe CommentData that contains comment column and I want to find occurrence of two words together. Then try

bigram_filtered <- CommentData %>%
  unnest_tokens(bigram, Comment, token= "ngrams", n=2) %>%
  separate(bigram, c("word1","word2"), sep=" ") %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word) %>%
  count(word1, word2, sort=TRUE)

The above code creates tokens, and then remove stop words that doesn't help in analysis(eg. the,an,to etc.) Then you count occurrence of these words. You will be then using unite function to combine individual words and record their occurrence.

bigrams_united <- bigram_filtered %>%
  unite(bigram, word1, word2, sep=" ")
bigrams_united
like image 45
Monika Singh Avatar answered Oct 01 '22 19:10

Monika Singh