Package tm stop-word parameter

Question

I am trying to filter stop-words from the following documents using package tm.

library(tm)
documents <- c("the quick brown fox jumps over the lazy dog", "i am the walrus")
corpus <- Corpus(VectorSource(documents))
matrix <- DocumentTermMatrix(corpus,control=list(stopwords=TRUE))

However, when I run this code I still get the following in the DocumentTermMatrix.

colnames(matrix)
[1] "brown"  "dog"    "fox"    "jumps"  "lazy"   "over"   "quick"  "the"    "walrus"

"The" is listed as a stop-word in the list that package tm uses. Am I doing something wrong regarding the stopwords parameter, or is this a bug in the tm package?

EDIT: I contacted Ingo Feinerer and he noted that it is technically not a bug:

User-provided options are processed first, and then all remaining options. Hence stopword removal is done before tokenization (as already written by Vincent Zoonekynd on stackoverflow.com) which gives exactly your result.

Therefore, the solution is to explicitly list the default tokenizing option prior to the stopwords parameter, for example:

library(tm)
documents <- c("the quick brown fox jumps over the lazy dog", "i am the walrus")
corpus <- Corpus(VectorSource(documents))
matrix <- DocumentTermMatrix(corpus,control=list(tokenize=scan_tokenizer,stopwords=TRUE))
colnames(matrix)

Shreyes · Accepted Answer

You could also try removing the stopwords from the corpus before you create the term matrix.

text_corpus <- tm_map(text_corpus, removeWords, stopwords("english"))
dtm <- DocumentTermMatrix(text_corpus)

This usually works for me.

Package tm stop-word parameter

Tags:

r

nlp

Timothy P. Jurka

1 Answers

Shreyes

Recent Activity

Donate For Us

Package tm stop-word parameter

Tags:

r

nlp

Timothy P. Jurka

1 Answers

Shreyes

Related questions

Recent Activity

Donate For Us