Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do i optimize the performance of stemming and spell check in R?

I have ~1,4m documents with average of characters per document of(Median:250 and Mean:470).

I want to perform spell check and stemming, before classifying them.

Simulated document:

sentence <- "We aree drivng as fast as we drove yestrday or evven fastter zysxzw" %>%
    rep(times = 6) %>%
    paste(collapse = " ")

nchar(sentence)
[1] 407 

function to perform first spellcheck and then stemming

library(hunspell)
library(magrittr)

spellAndStem <- function(sent, language = "en_US"){
  words <- sentence %>%
    strsplit(split = " ") %>%
    unlist

  # spelling
  correct <- hunspell_check(
        words = words, 
        dict = dictionary(language)
  )

  words[!correct] %<>%
    hunspell_suggest(dict = language) %>%
    sapply(FUN = "[", 1)

  # stemming
  words %>%
    hunspell_stem(dict = dictionary(language)) %>%
    unlist %>%
    paste(collapse = " ")
}

I looked in to hunspell() function to give over the document as a whole for Performance gains, but i dont see how i could do spell check and stemming in that sequence.

Time measurement:

> library(microbenchmark)
> microbenchmark(spellAndStem(sentence), times = 100)
Unit: milliseconds
                   expr      min       lq     mean   median       uq      max neval
 spellAndStem(sentence) 680.3601 689.8842 700.7957 694.3781 702.7493 798.9544   100

With 0.7s per document, it would require 0.7*1400000/3600/24 = 11.3 days to do the calculation.

Question:

How can i optimize this Performance?

Final remark:

Target language is 98% German and 2% english. Not sure if the info matters, just for completeness.

like image 763
Tlatwork Avatar asked Feb 20 '20 12:02

Tlatwork


2 Answers

You can substantially optimize your code by performing expensive steps on the vocabulary instead of all words in the document. The quanteda package offers a really useful object class or this called tokens:

toks <- quanteda::tokens(sentence)
unclass(toks)
#> $text1
#>  [1]  1  2  3  4  5  4  6  7  8  9 10 11 12  1  2  3  4  5  4  6  7  8  9 10 11
#> [26] 12  1  2  3  4  5  4  6  7  8  9 10 11 12  1  2  3  4  5  4  6  7  8  9 10
#> [51] 11 12  1  2  3  4  5  4  6  7  8  9 10 11 12  1  2  3  4  5  4  6  7  8  9
#> [76] 10 11 12
#> 
#> attr(,"types")
#>  [1] "We"       "aree"     "drivng"   "as"       "fast"     "we"      
#>  [7] "drove"    "yestrday" "or"       "evven"    "fastter"  "zysxzw"  
#> attr(,"padding")
#> [1] FALSE
#> attr(,"what")
#> [1] "word"
#> attr(,"ngrams")
#> [1] 1
#> attr(,"skip")
#> [1] 0
#> attr(,"concatenator")
#> [1] "_"
#> attr(,"docvars")
#> data frame with 0 columns and 1 row

As you can see, text is split into vocabulary (types) and position of the words. We can use this to optimize your code by performing all steps on the types instead of the entire text:

spellAndStem_tokens <- function(sent, language = "en_US") {

  sent_t <- quanteda::tokens(sent)

  # extract types to only work on them
  types <- quanteda::types(sent_t)

  # spelling
  correct <- hunspell_check(
    words = as.character(types), 
    dict = hunspell::dictionary(language)
  )

  pattern <- types[!correct]
  replacement <- sapply(hunspell_suggest(pattern, dict = language), FUN = "[", 1)

  types <- stringi::stri_replace_all_fixed(
    types,
    pattern, 
    replacement,
    vectorize_all = FALSE
  )

  # stemming
  types <- hunspell_stem(types, dict = dictionary(language))


  # replace original tokens
  sent_t_new <- quanteda::tokens_replace(sent_t, quanteda::types(sent_t), as.character(types))

  sent_t_new <- quanteda::tokens_remove(sent_t_new, pattern = "NULL", valuetype = "fixed")

  paste(as.character(sent_t_new), collapse = " ")
}

I'm using the bench package to do the benchmarking as it also checks if the results of the two functions are identical and as I find it more comfortable in general:

res <- bench::mark(
  spellAndStem(sentence),
  spellAndStem_tokens(sentence)
)

res
#> # A tibble: 2 x 6
#>   expression                         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 spellAndStem(sentence)           807ms    807ms      1.24     259KB        0
#> 2 spellAndStem_tokens(sentence)    148ms    150ms      6.61     289KB        0

summary(res, relative = TRUE)
#> # A tibble: 2 x 6
#>   expression                      min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                    <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
#> 1 spellAndStem(sentence)         5.44   5.37      1         1         NaN
#> 2 spellAndStem_tokens(sentence)  1      1         5.33      1.11      NaN

The new function is 5.44 times faster than the original one. Note though that the difference is getting even more pronounced the larger the input text is:

sentence <- "We aree drivng as fast as we drove yestrday or evven fastter zysxzw" %>%
  rep(times = 600) %>%
  paste(collapse = " ")

res_big <- bench::mark(
  spellAndStem(sentence),
  spellAndStem_tokens(sentence)
)

res_big
#> # A tibble: 2 x 6
#>   expression                         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 spellAndStem(sentence)         1.27m    1.27m      0.0131  749.81KB        0
#> 2 spellAndStem_tokens(sentence)  178.26ms 182.12ms   5.51      1.94MB        0
summary(res_big, relative = TRUE)
#> # A tibble: 2 x 6
#>   expression                      min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                    <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
#> 1 spellAndStem(sentence)         428.   419.        1       1         NaN
#> 2 spellAndStem_tokens(sentence)   1      1       420.       2.65      NaN

As you can see, the time it takes to process the 100 x bigger sample is almost the same as for the smaller one. This is because the vocabulary between the two is exactly the same. We can extrapolate from this result to your entire dataset assuming this bigger sample represents 100 of your documents. The function should take less than an hour (0.17826 * 14000 / 3600 = 0.69) but the calculation is really imperfect as the actual time it takes to run it on your real data will depend almost solely on the size of the vocabulary.

Besides the programming/performance aspect, I have a few more concerns that might not be applicable in your specific case:

  1. I would suggest changing the last line in the function to sapply(as.list(sent_t_new), paste, collapse = " ") as this will not collapse all documents into one long string but keep them separate.
  2. Currently, your setup removes words where hunspell could not find any suggestions for. I copied this approach (see the tokens_remove command) but you might want to think about at least outputting the discarded words instead of removing them silently.
  3. If the function above is for preparation for some other text analysis, it would make more sense to transform the data directly into a document-term-matrix before stemming and spell checking are performed.
  4. Stemming is just an approximation to lemmatization, which is the process of actually finding the base form of a word. Additionally, stemming usually works quite poorly in German. Depending on what you are doing, you might want to do lemmatization instead (e.g., using spacyr) or simply turning it off since stemming rarely improves results in German.
like image 174
JBGruber Avatar answered Sep 24 '22 07:09

JBGruber


hunspell_suggest is just an expensive operation since it calculates the distance between your string and every word in the dictionary (see here: https://github.com/ropensci/hunspell/issues/7). When I remove the hunspell_suggest lines, it only takes 25ms on average on my machine. So if you want to speed it up, this is the critical part. Note that it makes a difference how many incorrect words are in the actual documents. Your example with about 50% misspelled words should rather be the exception. Why don't you try the algorithm on the first couple of documents first to have a more realistic time estimate. I assume the language will matter (for your benefit) as there are more words in English than German (think dictionary size).

A simple and obvious thing to do would be to use multiple cores. Something simple as the following with the parallel package already halves the time with my four cores:

sentences <- rep(sentence, 4)
microbenchmark(lapply = lapply(sentences, spellAndStem),
               mclapply = parallel::mclapply(sentences, spellAndStem),
               times = 10)

Unit: seconds
                                        expr      min       lq     mean   median       uq      max neval cld
             lapply(sentences, spellAndStem) 1.967008 2.023291 2.045705 2.051764 2.077168 2.105420    10   b
 parallel::mclapply(sentences, spellAndStem) 1.011945 1.048055 1.078003 1.081850 1.109274 1.135508    10  a 

The suggestion by Andrew Gustar could also work. Even if you just apply the suggest function to a group of documents, this should speed up the calculation significantly. The problem is to separate the documents and put them together after stemming---I guess a "separator" for the documents would just be stemmed and be unrecognizable afterwards. Judging from your question you have already tried this or something similar.

A smaller dictionary could also help, but is probably not a good idea if you want high-quality data.

By the way, I would not consider 11 days to be long for a computation that has to be done only once. You could simply upload the script to a Server that has R installed and let it run there via Rscript from the shell (use nohup to log out again without stopping the process). This is especially true if you have access to a strong "working machine" (e.g. at a university) with many cores.

like image 25
Johannes Titz Avatar answered Sep 23 '22 07:09

Johannes Titz