Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stemming function for text2vec

Tags:

r

text2vec

I am using text2vec in R and having difficulty writing a stemming function that works with the itoken function in the text2vec package. The text2vec documentation suggests this stemming function:

stem_tokenizer1 =function(x) {
  word_tokenizer(x) %>% lapply(SnowballC::wordStem(language='en'))
 }

However, this function does not work. This is the code I ran (borrowed from previous stackoverflow answers):

library(text2vec)
library(data.table)
library(SnowballC)
data("movie_review")
train_rows = 1:1000
prepr = tolower
stem_tokenizer1 =function(x) {
  word_tokenizer(x) %>% lapply(SnowballC::wordStem(language='en'))
 }
tok = stem_tokenizer1
it <- itoken(movie_review$review[train_rows], prepr, tok, ids = movie_review$id[train_rows])

This is the error it produces:

Error in { : argument "words" is missing, with no default

I believe the issue is that wordStem needs a character vector, but word_tokenizer produces a list of character vectors.

mr<-movie_review$review[1]
stem_mr1<-stem_tokenizer1(mr)

Error in SnowballC::wordStem(language = "en") : argument "words" is missing, with no default

To fix this issue, I wrote this stemming function:

stem_tokenizer2 = function(x)  {
  list(unlist(word_tokenizer(x)) %>% SnowballC::wordStem(language='en') )
}

However, this function does not work with the create_vocabulary function.

data("movie_review")
train_rows = 1:1000
prepr = tolower
stem_tokenizer2 = function(x)  {
  list(unlist(word_tokenizer(x)) %>% SnowballC::wordStem(language='en') )
}
tok = stem_tokenizer2
it <- itoken(movie_review$review[train_rows], prepr, tok, ids = movie_review$id[train_rows])
v <- create_vocabulary(it) %>% prune_vocabulary(term_count_min = 5)

No error, but when you look at the document count, the number of documents is different than the 1000 in the data, and so you cannot create a document term matrix or run an LDA.

v$document_count

[1] 10

This code:

dtm_train <- create_dtm(it, vectorizer)
dtm_train

Producess this error:

10 x 3809 sparse Matrix of class "dgCMatrix" Error in validObject(x) : invalid class “dgCMatrix” object: length(Dimnames[1]) differs from Dim[1] which is 10

My questions are: is there something wrong with the function I wrote, and why would the function I wrote produce this error with create_vocabulary? I suspect it is a problem with the format of the output of my function, but it looks identical to the word_tokenizer function's output format, and that works fine with itoken and create_vocabulary:

mr<-movie_review$review[1]
word_mr<-word_tokenizer(mr)
stem_mr<-stem_tokenizer2(mr)
str(word_mr)
str(stem_mr)
like image 396
rreedd Avatar asked Nov 21 '16 11:11

rreedd


1 Answers

Thanks for using text2vec and reporting the problem. There is a mistake in docs (can you point me where I put this example, so I can fix it?). Stem tokenizer should look like:

stem_tokenizer1 =function(x) {
  word_tokenizer(x) %>% lapply( function(x) SnowballC::wordStem(x, language="en"))
 }

Logic is following:

  1. It take character vector and tokenize it. Output is list of character vectors (each element of list = character vector is a document).
  2. Then we apply stemming on each element of list (wordStem can be applied on character vector)

So there was my syntax mistake for lapply in example you followed. Mb it will be more clear if we rewrite it without %>% operator in plain R so it will look like:

stem_tokenizer1 =function(x) {
  tokens = word_tokenizer(x)
  lapply(tokens, SnowballC::wordStem, language="en")
}

I will also explain why you are receiving 10 docs instead of 1000. By default text2vec::itoken split data into 10 chunks (this can be adjusted in itoken function) and process it chunk by chunk. So when you are applying unlist on each chunk you are actually recursively unlist 100 documents and creating 1 character vector.

like image 122
Dmitriy Selivanov Avatar answered Oct 12 '22 07:10

Dmitriy Selivanov