I am trying to find a code that actually works to find the most frequently used two and three word phrases in R text mining package (maybe there is another package for it that I do not know). I have been trying to use the tokenizer, but seem to have no luck.
If you worked on a similar situation in the past, could you post a code that is tested and actually works? Thank you so much!
According to director Andrew Stanton on the audio commentary for the Finding Nemo DVD, in the original story, Dory was going to be a male character but when Stanton went home to write the script his wife was watching The Ellen DeGeneres Show and when he heard DeGeneres' voice he decided to change Dory to a female and ...
Amnesia in the movies. The character Dory from the movies Finding Nemo and Finding Dory is an example of a movie character who has amnesia, or memory loss. Some of the things that Dory does in the movies are a lot like real-life amnesia. For example, Dory forgets that she has met Marlin, another character in the movie.
Other blue tangs tell them that Dory's parents escaped from the institute a long time ago to search for her and never came back, leaving Dory to believe that they have died. Hank retrieves Dory from the tank, accidentally leaving Marlin and Nemo behind.
Finding Dory is available to stream in Australia now on Google Play and Apple TV and Disney+ and Prime Video Store.
You can pass in a custom tokenizing function to tm
's DocumentTermMatrix
function, so if you have package tau
installed it's fairly straightforward.
library(tm); library(tau);
tokenize_ngrams <- function(x, n=3) return(rownames(as.data.frame(unclass(textcnt(x,method="string",n=n)))))
texts <- c("This is the first document.", "This is the second file.", "This is the third text.")
corpus <- Corpus(VectorSource(texts))
matrix <- DocumentTermMatrix(corpus,control=list(tokenize=tokenize_ngrams))
Where n
in the tokenize_ngrams
function is the number of words per phrase. This feature is also implemented in package RTextTools
, which further simplifies things.
library(RTextTools)
texts <- c("This is the first document.", "This is the second file.", "This is the third text.")
matrix <- create_matrix(texts,ngramLength=3)
This returns a class of DocumentTermMatrix
for use with package tm
.
This is part 5 of the FAQ of the tm package:
5. Can I use bigrams instead of single tokens in a term-document matrix?
Yes. RWeka provides a tokenizer for arbitrary n-grams which can be directly passed on to the term-document matrix constructor. E.g.:
library("RWeka")
library("tm")
data("crude")
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))
inspect(tdm[340:345,1:10])
This is my own made up creation for different purposes but I think may applicable to your needs too:
#User Defined Functions
Trim <- function (x) gsub("^\\s+|\\s+$", "", x)
breaker <- function(x) unlist(strsplit(x, "[[:space:]]|(?=[.!?*-])", perl=TRUE))
strip <- function(x, digit.remove = TRUE, apostrophe.remove = FALSE){
strp <- function(x, digit.remove, apostrophe.remove){
x2 <- Trim(tolower(gsub(".*?($|'|[^[:punct:]]).*?", "\\1", as.character(x))))
x2 <- if(apostrophe.remove) gsub("'", "", x2) else x2
ifelse(digit.remove==TRUE, gsub("[[:digit:]]", "", x2), x2)
}
unlist(lapply(x, function(x) Trim(strp(x =x, digit.remove = digit.remove,
apostrophe.remove = apostrophe.remove)) ))
}
unblanker <- function(x)subset(x, nchar(x)>0)
#Fake Text Data
x <- "I like green eggs and ham. They are delicious. They taste so yummy. I'm talking about ham and eggs of course"
#The code using Base R to Do what you want
breaker(x)
strip(x)
words <- unblanker(breaker(strip(x)))
textDF <- as.data.frame(table(words))
textDF$characters <- sapply(as.character(textDF$words), nchar)
textDF2 <- textDF[order(-textDF$characters, textDF$Freq), ]
rownames(textDF2) <- 1:nrow(textDF2)
textDF2
subset(textDF2, characters%in%2:3)
The corpus library has a function called term_stats
that does what you want:
library(corpus)
corpus <- gutenberg_corpus(55) # Project Gutenberg #55, _The Wizard of Oz_
text_filter(corpus)$drop_punct <- TRUE # ignore punctuation
term_stats(corpus, ngrams = 2:3)
## term count support
## 1 of the 336 1
## 2 the scarecrow 208 1
## 3 to the 185 1
## 4 and the 166 1
## 5 said the 152 1
## 6 in the 147 1
## 7 the lion 141 1
## 8 the tin 123 1
## 9 the tin woodman 114 1
## 10 tin woodman 114 1
## 11 i am 84 1
## 12 it was 69 1
## 13 in a 64 1
## 14 the great 63 1
## 15 the wicked 61 1
## 16 wicked witch 60 1
## 17 at the 59 1
## 18 the little 59 1
## 19 the wicked witch 58 1
## 20 back to 57 1
## ⋮ (52511 rows total)
Here, count
is the number of appearances, and support
is the number of documents containing the term.
I add a similar problem by using tm
and ngram
packages.
After debugging mclapply
, I saw there where problems on documents with less than 2 words with the following error
input 'x' has nwords=1 and n=2; must have nwords >= n
So I've added a filter to remove document with low word count number:
myCorpus.3 <- tm_filter(myCorpus.2, function (x) {
length(unlist(strsplit(stringr::str_trim(x$content), '[[:blank:]]+'))) > 1
})
Then my tokenize function looks like:
bigramTokenizer <- function(x) {
x <- as.character(x)
# Find words
one.list <- c()
tryCatch({
one.gram <- ngram::ngram(x, n = 1)
one.list <- ngram::get.ngrams(one.gram)
},
error = function(cond) { warning(cond) })
# Find 2-grams
two.list <- c()
tryCatch({
two.gram <- ngram::ngram(x, n = 2)
two.list <- ngram::get.ngrams(two.gram)
},
error = function(cond) { warning(cond) })
res <- unlist(c(one.list, two.list))
res[res != '']
}
Then you can test the function with:
dtmTest <- lapply(myCorpus.3, bigramTokenizer)
And finally:
dtm <- DocumentTermMatrix(myCorpus.3, control = list(tokenize = bigramTokenizer))
Try tidytext package
library(dplyr)
library(tidytext)
library(janeaustenr)
library(tidyr
)
Suppose I have a dataframe CommentData that contains comment column and I want to find occurrence of two words together. Then try
bigram_filtered <- CommentData %>%
unnest_tokens(bigram, Comment, token= "ngrams", n=2) %>%
separate(bigram, c("word1","word2"), sep=" ") %>%
filter(!word1 %in% stop_words$word,
!word2 %in% stop_words$word) %>%
count(word1, word2, sort=TRUE)
The above code creates tokens, and then remove stop words that doesn't help in analysis(eg. the,an,to etc.) Then you count occurrence of these words. You will be then using unite function to combine individual words and record their occurrence.
bigrams_united <- bigram_filtered %>%
unite(bigram, word1, word2, sep=" ")
bigrams_united
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With