Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Quanteda: how to remove my own list of words

Since there is no ready implementation of stopwords for Polish in quanteda, I would like to use my own list. I have it in a text file as a list separated by spaces. If need be, I can also prepare a list separated by new lines.

How can I remove the custom long list of stopwords from my corpus? How can I do that after stemming?

I have tried creating various formats, converting to string vectors like

stopwordsPL <- as.character(readtext("polish.stopwords.txt",encoding = "UTF-8"))
stopwordsPL <- read.txt("polish.stopwords.txt",encoding = "UTF-8",stringsAsFactors = F))
stopwordsPL <- dictionary(stopwordsPL)

I have also tried to use such vectors of words in syntax

myStemMat <-
  dfm(
    mycorpus,
    remove = as.vector(stopwordsPL),
    stem = FALSE,
    remove_punct = TRUE,
    ngrams=c(1,3)
  )

dfm_trim(myStemMat, sparsity = stopwordsPL)

or

myStemMat <- dfm_remove(myStemMat,features = as.data.frame(stopwordsPL))

Nothing works. My stopwords show up in the corpus and in the analysis. What should be the proper way/syntax to apply custom stop words?

like image 557
Jacek Kotowski Avatar asked Jul 26 '17 12:07

Jacek Kotowski


1 Answers

Assuming your polish.stopwords.txt are like this then you should be able to remove them from your corpus easily this way:

stopwordsPL <- readLines("polish.stopwords.txt", encoding = "UTF-8")

dfm(mycorpus,
    remove = stopwordsPL,
    stem = FALSE,
    remove_punct = TRUE,
    ngrams=c(1,3))

The solution using readtext is not working because it reads in the entire file as one document. To get the individual words, you would need to tokenise it and to coerce the tokens to character. Probably readLines() is easier.

No need to create a dictionary from stopwordsPL either, since remove should take a character vector. Also, there is no Polish stemmer implemented yet, I am afraid.

Currently (v0.9.9-65) the feature removal in dfm() does not get rid of stop words that form bigrams. To override this, try:

# form the tokens, removing punctuation
mytoks <- tokens(mycorpus, remove_punct = TRUE)
# remove the Polish stopwords, leave pads
mytoks <- tokens_remove(mytoks, stopwordsPL, padding = TRUE)
## can't do this next one since no Polish stemmer in 
## SnowballC::getStemLanguages()
# mytoks <- tokens_wordstem(mytoks, language = "polish")
# form the ngrams
mytoks <- tokens_ngrams(mytoks, n = c(1, 3))
# construct the dfm
dfm(mytoks)
like image 169
Ken Benoit Avatar answered Nov 02 '22 02:11

Ken Benoit