I am trying to filter stop-words from the following documents using package tm
.
library(tm)
documents <- c("the quick brown fox jumps over the lazy dog", "i am the walrus")
corpus <- Corpus(VectorSource(documents))
matrix <- DocumentTermMatrix(corpus,control=list(stopwords=TRUE))
However, when I run this code I still get the following in the DocumentTermMatrix
.
colnames(matrix)
[1] "brown" "dog" "fox" "jumps" "lazy" "over" "quick" "the" "walrus"
"The" is listed as a stop-word in the list that package tm
uses. Am I doing something wrong regarding the stopwords
parameter, or is this a bug in the tm
package?
EDIT: I contacted Ingo Feinerer and he noted that it is technically not a bug:
User-provided options are processed first, and then all remaining options. Hence stopword removal is done before tokenization (as already written by Vincent Zoonekynd on stackoverflow.com) which gives exactly your result.
Therefore, the solution is to explicitly list the default tokenizing option prior to the stopwords
parameter, for example:
library(tm)
documents <- c("the quick brown fox jumps over the lazy dog", "i am the walrus")
corpus <- Corpus(VectorSource(documents))
matrix <- DocumentTermMatrix(corpus,control=list(tokenize=scan_tokenizer,stopwords=TRUE))
colnames(matrix)
You could also try removing the stopwords from the corpus before you create the term matrix.
text_corpus <- tm_map(text_corpus, removeWords, stopwords("english"))
dtm <- DocumentTermMatrix(text_corpus)
This usually works for me.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With