I have been using the tm package to run some text analysis. My problem is with creating a list with words and their frequencies associated with the same
library(tm) library(RWeka) txt <- read.csv("HW.csv",header=T) df <- do.call("rbind", lapply(txt, as.data.frame)) names(df) <- "text" myCorpus <- Corpus(VectorSource(df$text)) myStopwords <- c(stopwords('english'),"originally", "posted") myCorpus <- tm_map(myCorpus, removeWords, myStopwords) #building the TDM btm <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3)) myTdm <- TermDocumentMatrix(myCorpus, control = list(tokenize = btm))
I typically use the following code for generating list of words in a frequency range
frq1 <- findFreqTerms(myTdm, lowfreq=50)
Is there any way to automate this such that we get a dataframe with all words and their frequency?
The other problem that i face is with converting the term document matrix into a data frame. As i am working on large samples of data, I run into memory errors. Is there a simple solution for this?
You can use sapply() to go the counts and match every item in counts against the strings column in df using grepl() this will return a logical vector ( TRUE if match, FALSE if non-match). You can sum this vector up to get the number of matches.
To create a frequency table in R, we can simply use table function but the output of table function returns a horizontal table. If we want to read the table in data frame format then we would need to read the table as a data frame using as. data. frame function.
Try this
data("crude") myTdm <- as.matrix(TermDocumentMatrix(crude)) FreqMat <- data.frame(ST = rownames(myTdm), Freq = rowSums(myTdm), row.names = NULL) head(FreqMat, 10) # ST Freq # 1 "(it) 1 # 2 "demand 1 # 3 "expansion 1 # 4 "for 1 # 5 "growth 1 # 6 "if 1 # 7 "is 2 # 8 "may 1 # 9 "none 2 # 10 "opec 2
I have the following lines in R that can help to create word frequencies and put them in a table, it reads the file of text in .txt format and create the frequencies of words, I hope that this can help to anyone interested.
avisos<- scan("anuncio.txt", what="character", sep="\n") avisos1 <- tolower(avisos) avisos2 <- strsplit(avisos1, "\\W") avisos3 <- unlist(avisos2) freq<-table(avisos3) freq1<-sort(freq, decreasing=TRUE) temple.sorted.table<-paste(names(freq1), freq1, sep="\\t") cat("Word\tFREQ", temple.sorted.table, file="anuncio.txt", sep="\n")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With