I am using text2vec in R and having difficulty writing a stemming function that works with the itoken function in the text2vec package. The text2vec documentation suggests this stemming function:
stem_tokenizer1 =function(x) {
word_tokenizer(x) %>% lapply(SnowballC::wordStem(language='en'))
}
However, this function does not work. This is the code I ran (borrowed from previous stackoverflow answers):
library(text2vec)
library(data.table)
library(SnowballC)
data("movie_review")
train_rows = 1:1000
prepr = tolower
stem_tokenizer1 =function(x) {
word_tokenizer(x) %>% lapply(SnowballC::wordStem(language='en'))
}
tok = stem_tokenizer1
it <- itoken(movie_review$review[train_rows], prepr, tok, ids = movie_review$id[train_rows])
This is the error it produces:
Error in { : argument "words" is missing, with no default
I believe the issue is that wordStem needs a character vector, but word_tokenizer produces a list of character vectors.
mr<-movie_review$review[1]
stem_mr1<-stem_tokenizer1(mr)
Error in SnowballC::wordStem(language = "en") : argument "words" is missing, with no default
To fix this issue, I wrote this stemming function:
stem_tokenizer2 = function(x) {
list(unlist(word_tokenizer(x)) %>% SnowballC::wordStem(language='en') )
}
However, this function does not work with the create_vocabulary function.
data("movie_review")
train_rows = 1:1000
prepr = tolower
stem_tokenizer2 = function(x) {
list(unlist(word_tokenizer(x)) %>% SnowballC::wordStem(language='en') )
}
tok = stem_tokenizer2
it <- itoken(movie_review$review[train_rows], prepr, tok, ids = movie_review$id[train_rows])
v <- create_vocabulary(it) %>% prune_vocabulary(term_count_min = 5)
No error, but when you look at the document count, the number of documents is different than the 1000 in the data, and so you cannot create a document term matrix or run an LDA.
v$document_count
[1] 10
This code:
dtm_train <- create_dtm(it, vectorizer)
dtm_train
Producess this error:
10 x 3809 sparse Matrix of class "dgCMatrix" Error in validObject(x) : invalid class “dgCMatrix” object: length(Dimnames[1]) differs from Dim[1] which is 10
My questions are: is there something wrong with the function I wrote, and why would the function I wrote produce this error with create_vocabulary? I suspect it is a problem with the format of the output of my function, but it looks identical to the word_tokenizer function's output format, and that works fine with itoken and create_vocabulary:
mr<-movie_review$review[1]
word_mr<-word_tokenizer(mr)
stem_mr<-stem_tokenizer2(mr)
str(word_mr)
str(stem_mr)
Thanks for using text2vec
and reporting the problem.
There is a mistake in docs (can you point me where I put this example, so I can fix it?).
Stem tokenizer should look like:
stem_tokenizer1 =function(x) {
word_tokenizer(x) %>% lapply( function(x) SnowballC::wordStem(x, language="en"))
}
Logic is following:
wordStem
can be applied on character vector)So there was my syntax mistake for lapply
in example you followed. Mb it will be more clear if we rewrite it without %>%
operator in plain R so it will look like:
stem_tokenizer1 =function(x) {
tokens = word_tokenizer(x)
lapply(tokens, SnowballC::wordStem, language="en")
}
I will also explain why you are receiving 10 docs instead of 1000. By default text2vec::itoken
split data into 10 chunks (this can be adjusted in itoken
function) and process it chunk by chunk.
So when you are applying unlist
on each chunk you are actually recursively unlist 100 documents and creating 1 character vector.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With