With the tm package I'm able to do it like this:
c0 <- Corpus(VectorSource(text))
c0 <- tm_map(c0, removeWords, c(stopwords("english"),mystopwords))
mystopwords
being a vector of the additional stopwords I want to remove.
But I can't find an equivalent way to do it using the RTextTools package. For example:
dtm <- create_matrix(text,language="english",
removePunctuation=T,
stripWhitespace=T,
toLower=T,
removeStopwords=T, #no clear way to specify a custom list here!
stemWords=T)
Is it possible to do this? I really like the RTextTools
interface and it would be a pity to have to move back to tm
.
But here's the catch: there's no universal stop words list because a word can be empty of meaning depending on the corpus you are using or on the problem you are analysing. This means that any word can be a stop word depending on what you are trying to do.
To add a custom stopword in Spacy, we first load its English language model and use add() method to add stopwords.
There are three (or possible even more) solutions to your problem:
First, use the tm
package only for removing words. Both packages deal with the same objects, therefore you can use tm
just for removing words and than the RTextTools
package. Even when you look inside the function create_matrix
it uses tm
functions.
Second, modify the create_matrix
function. For example add an input parameter like own_stopwords=NULL
and add the following lines:
# existing line
corpus <- Corpus(VectorSource(trainingColumn),
readerControl = list(language = language))
# after that add this new line
if(!is.null(own_stopwords)) corpus <- tm_map(corpus, removeWords,
words=as.character(own_stopwords))
Third, write your own function, something like this:
# excluder function
remove_my_stopwords<-function(own_stw, dtm){
ind<-sapply(own_stw, function(x, words){
if(any(x==words)) return(which(x==words)) else return(NA)
}, words=colnames(dtm))
return(dtm[ ,-c(na.omit(ind))])
}
let´s have a look if it works:
# let´s test it
data(NYTimes)
data <- NYTimes[sample(1:3100, size=10,replace=FALSE),]
matrix <- create_matrix(cbind(data["Title"], data["Subject"]))
head(colnames(matrix), 5)
# [1] "109" "200th" "abc" "amid" "anniversary"
# let´s consider some "own" stopwords as words above
ostw <- head(colnames(matrix), 5)
matrix2<-remove_my_stopwords(own_stw=ostw, dtm=matrix)
# check if they are still there
sapply(ostw, function(x, words) any(x==words), words=colnames(matrix2))
#109 200th abc amid anniversary
#FALSE FALSE FALSE FALSE FALSE
HTH
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With