I am trying to make 2 document-term matrices for a corpus, one with unigrams and one with bigrams. However, the bigram matrix is currently just identical to the unigram matrix, and I'm not sure why.
The code:
docs<-Corpus(DirSource("data", recursive=TRUE))
# Get the document term matrices
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
dtm_unigram <- DocumentTermMatrix(docs, control = list(tokenize="words",
removePunctuation = TRUE,
stopwords = stopwords("english"),
stemming = TRUE))
dtm_bigram <- DocumentTermMatrix(docs, control = list(tokenize = BigramTokenizer,
removePunctuation = TRUE,
stopwords = stopwords("english"),
stemming = TRUE))
inspect(dtm_unigram)
inspect(dtm_bigram)
I also tried using ngram(x, n=2) from the ngram package as the tokenizer, but that doesn't work either. How do I fix the bigram tokenization?
The tokenizer option doesn't seem to work with Corpus (SimpleCorpus). Using VCorpus instead cleared up the problem.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With