I have a simple R code where I'm reading text from a file and plotting recurring phrases on a bar chart. For some reason, the bar chart only shows single words rather than multi worded phrases. Where am I going wrong?
install.packages("xlsx")
install.packages("tm")
install.packages("wordcloud")
install.packages("ggplot2")
library(xlsx)
library(tm)
library(wordcloud)
library(ggplot2)
setwd("C://Users//608447283//desktop//R_word_charts")
test <- Corpus(DirSource"C://Users//608447283//desktop//R_word_charts//source"))
test <- tm_map(test, stripWhitespace)
test <- tm_map(test, tolower)
test <- tm_map(test, removeWords,stopwords("english"))
test <- tm_map(test, removePunctuation)
test <- tm_map(test, PlainTextDocument)
tok <- function(x) NGramTokenizer(x, Weka_control(min=3, max=10))
tdm <- TermDocumentMatrix(test,control = list(tokenize = tok))
termFreq <- rowSums(as.matrix(tdm))
termFreq <- subset(termFreq, termFreq>=50)
write.csv(termFreq,file="TestCSV1")
TestCSV <- read.csv("C:/Users/608447283/Desktop/R_word_charts/TestCSV1")
ggplot(data=TestCSV, aes(x=X, y=x)) +
geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1))
My output:
Sample data: Sample data extract
There seems to be an issue with the lastest version of the tm
package (version 0.7).
Going back to version 0.6-2 should solve the 1-gram issue.
Another issue might be because of your data subset.
The filter (termFreq <- subset(termFreq, termFreq>=50)
) was too permissive and it filtered out lots of valuable N-Grams. I'd rather use the top-N approach of visualizing the data. i.e :
library(tm)
library(ggplot2)
library(RWeka)
library(data.table)
library(dplyr)
setwd(dir = "/home/eliasah/Downloads/")
test <- Corpus(DirSource("/home/eliasah/Downloads/sample/"))
test <- tm_map(test, stripWhitespace)
test <- tm_map(test, tolower)
test <- tm_map(test, removeWords,stopwords("english"))
test <- tm_map(test, removePunctuation)
test <- tm_map(test, PlainTextDocument)
tok <- function(x) NGramTokenizer(x, Weka_control(min=3, max=10))
tdm <- TermDocumentMatrix(test,control = list(tokenize = tok))
termFreq <- rowSums(as.matrix(tdm))
termFreqVector <- as.list(termFreq)
test2 <- data.frame(unlist(termFreqVector), stringsAsFactors=FALSE)
setDT(test2, keep.rownames = TRUE)[]
setnames(test2, 1, "term")
setnames(test2, 2, "freq")
test3 <- head(arrange(test2,desc(freq)), n = 30)
ggplot(data=test3, aes(x=reorder(term, freq), y=freq)) + geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) + coord_flip()
I hope this helps you solving your issue.
NB: I have used the data sample that you have linked in the question.
The bag is still there! But after asking the guys from the 'tm' package I used "VCorpus" instead of "Corpus" and now it is working.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With