I am trying to remove some words from a corpus I have built but it doesn't seem to be working. I first run through everything and create a dataframe that lists my words in order of their frequency. I use this list to identify words I am not interested in and then try to create a new list with the words removed. However, the words remain in my dataset. I am wondering what I am doing wrong and why the words aren't being removed? I have included the full code below:
install.packages("rvest")
install.packages("tm")
install.packages("SnowballC")
install.packages("stringr")
library(stringr)
library(tm)
library(SnowballC)
library(rvest)
# Pull in the data I have been using.
paperList <- html("http://journals.plos.org/plosone/search?q=nutrigenomics&sortOrder=RELEVANCE&filterJournals=PLoSONE&resultsPerPage=192")
paperURLs <- paperList %>%
html_nodes(xpath="//*[@class='search-results-title']/a") %>%
html_attr("href")
paperURLs <- paste("http://journals.plos.org", paperURLs, sep = "")
paper_html <- sapply(1:length(paperURLs), function(x) html(paperURLs[x]))
paperText <- sapply(1:length(paper_html), function(x) paper_html[[1]] %>%
html_nodes(xpath="//*[@class='article-content']") %>%
html_text() %>%
str_trim(.))
# Create corpus
paperCorp <- Corpus(VectorSource(paperText))
for(j in seq(paperCorp))
{
paperCorp[[j]] <- gsub(":", " ", paperCorp[[j]])
paperCorp[[j]] <- gsub("\n", " ", paperCorp[[j]])
paperCorp[[j]] <- gsub("-", " ", paperCorp[[j]])
}
paperCorp <- tm_map(paperCorp, removePunctuation)
paperCorp <- tm_map(paperCorp, removeNumbers)
paperCorp <- tm_map(paperCorp, removeWords, stopwords("english"))
paperCorp <- tm_map(paperCorp, stemDocument)
paperCorp <- tm_map(paperCorp, stripWhitespace)
paperCorpPTD <- tm_map(paperCorp, PlainTextDocument)
dtm <- DocumentTermMatrix(paperCorpPTD)
termFreq <- colSums(as.matrix(dtm))
head(termFreq)
tf <- data.frame(term = names(termFreq), freq = termFreq)
tf <- tf[order(-tf[,2]),]
head(tf)
# After having identified words I am not interested in
# create new corpus with these words removed.
paperCorp1 <- tm_map(paperCorp, removeWords, c("also", "article", "Article",
"download", "google", "figure",
"fig", "groups","Google", "however",
"high", "human", "levels",
"larger", "may", "number",
"shown", "study", "studies", "this",
"using", "two", "the", "Scholar",
"pubmedncbi", "PubMedNCBI",
"view", "View", "the", "biol",
"via", "image", "doi", "one",
"analysis"))
paperCorp1 <- tm_map(paperCorp1, stripWhitespace)
paperCorpPTD1 <- tm_map(paperCorp1, PlainTextDocument)
dtm1 <- DocumentTermMatrix(paperCorpPTD1)
termFreq1 <- colSums(as.matrix(dtm1))
tf1 <- data.frame(term = names(termFreq1), freq = termFreq1)
tf1 <- tf1[order(-tf1[,2]),]
head(tf1, 100)
If you look through tf1
you will notice that plenty of the words that were specified to be removed have not actually been removed.
Just wondering what I am doing wrong, and how I might remove these words from my data?
NOTE: removeWords
is doing something because the output from head(tm, 100)
and head(tm1, 100)
are not exactly the same. So removeWords
seems to removing some instances of the words I am trying to get rid of, but not all instances.
If someone gets error like me and above solution still doesn't work, try use:
paperCorp <- tm_map(paperCorp, content_transformer(tolower))
instead of paperCorp <- tm_map(paperCorp, tolower)
because tolower()
is a function from base package and returns different structure (I mean changes something in the result type) so you can't use for example paperCorp[[j]]$content
but only paperCorp[[j]]
. It's just a digression, maybe halpful to someone.
I switched some code around and added tolower. The stopwords are all in lowercase, so you need to do that first before you remove stopwords.
paperCorp <- tm_map(paperCorp, removePunctuation)
paperCorp <- tm_map(paperCorp, removeNumbers)
# added tolower
paperCorp <- tm_map(paperCorp, tolower)
paperCorp <- tm_map(paperCorp, removeWords, stopwords("english"))
# moved stripWhitespace
paperCorp <- tm_map(paperCorp, stripWhitespace)
paperCorp <- tm_map(paperCorp, stemDocument)
Upper case words no longer needed, since we set everything to lower case. You can remove these.
paperCorp <- tm_map(paperCorp, removeWords, c("also", "article", "Article",
"download", "google", "figure",
"fig", "groups","Google", "however",
"high", "human", "levels",
"larger", "may", "number",
"shown", "study", "studies", "this",
"using", "two", "the", "Scholar",
"pubmedncbi", "PubMedNCBI",
"view", "View", "the", "biol",
"via", "image", "doi", "one",
"analysis"))
paperCorpPTD <- tm_map(paperCorp, PlainTextDocument)
dtm <- DocumentTermMatrix(paperCorpPTD)
termFreq <- colSums(as.matrix(dtm))
head(termFreq)
tf <- data.frame(term = names(termFreq), freq = termFreq)
tf <- tf[order(-tf[,2]),]
head(tf)
term freq
fatty fatty 29568
pparα ppara 23232
acids acids 22848
gene gene 15360
dietary dietary 12864
scholar scholar 11904
tf[tf$term == "study"]
data frame with 0 columns and 1659 rows
And as you can see, the outcome is that study is no longer in the corpus. The rest of the words are also gone
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With