Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Package tm stop-word parameter

Tags:

r

nlp

I am trying to filter stop-words from the following documents using package tm.

library(tm)
documents <- c("the quick brown fox jumps over the lazy dog", "i am the walrus")
corpus <- Corpus(VectorSource(documents))
matrix <- DocumentTermMatrix(corpus,control=list(stopwords=TRUE))

However, when I run this code I still get the following in the DocumentTermMatrix.

colnames(matrix)
[1] "brown"  "dog"    "fox"    "jumps"  "lazy"   "over"   "quick"  "the"    "walrus"

"The" is listed as a stop-word in the list that package tm uses. Am I doing something wrong regarding the stopwords parameter, or is this a bug in the tm package?

EDIT: I contacted Ingo Feinerer and he noted that it is technically not a bug:

User-provided options are processed first, and then all remaining options. Hence stopword removal is done before tokenization (as already written by Vincent Zoonekynd on stackoverflow.com) which gives exactly your result.

Therefore, the solution is to explicitly list the default tokenizing option prior to the stopwords parameter, for example:

library(tm)
documents <- c("the quick brown fox jumps over the lazy dog", "i am the walrus")
corpus <- Corpus(VectorSource(documents))
matrix <- DocumentTermMatrix(corpus,control=list(tokenize=scan_tokenizer,stopwords=TRUE))
colnames(matrix)
like image 805
Timothy P. Jurka Avatar asked Jan 18 '23 16:01

Timothy P. Jurka


1 Answers

You could also try removing the stopwords from the corpus before you create the term matrix.

text_corpus <- tm_map(text_corpus, removeWords, stopwords("english"))
dtm <- DocumentTermMatrix(text_corpus)

This usually works for me.

like image 100
Shreyes Avatar answered Jan 27 '23 08:01

Shreyes