I have ~1,4m documents with average of characters per document of(Median:250 and Mean:470).
I want to perform spell check and stemming, before classifying them.
Simulated document:
sentence <- "We aree drivng as fast as we drove yestrday or evven fastter zysxzw" %>%
rep(times = 6) %>%
paste(collapse = " ")
nchar(sentence)
[1] 407
function to perform first spellcheck and then stemming
library(hunspell)
library(magrittr)
spellAndStem <- function(sent, language = "en_US"){
words <- sentence %>%
strsplit(split = " ") %>%
unlist
# spelling
correct <- hunspell_check(
words = words,
dict = dictionary(language)
)
words[!correct] %<>%
hunspell_suggest(dict = language) %>%
sapply(FUN = "[", 1)
# stemming
words %>%
hunspell_stem(dict = dictionary(language)) %>%
unlist %>%
paste(collapse = " ")
}
I looked in to hunspell()
function to give over the document as a whole for Performance gains, but i dont see how i could do spell check and stemming in that sequence.
Time measurement:
> library(microbenchmark)
> microbenchmark(spellAndStem(sentence), times = 100)
Unit: milliseconds
expr min lq mean median uq max neval
spellAndStem(sentence) 680.3601 689.8842 700.7957 694.3781 702.7493 798.9544 100
With 0.7s per document, it would require 0.7*1400000/3600/24 = 11.3 days to do the calculation.
Question:
How can i optimize this Performance?
Final remark:
Target language is 98% German and 2% english. Not sure if the info matters, just for completeness.
You can substantially optimize your code by performing expensive steps on the vocabulary instead of all words in the document. The quanteda
package offers a really useful object class or this called tokens
:
toks <- quanteda::tokens(sentence)
unclass(toks)
#> $text1
#> [1] 1 2 3 4 5 4 6 7 8 9 10 11 12 1 2 3 4 5 4 6 7 8 9 10 11
#> [26] 12 1 2 3 4 5 4 6 7 8 9 10 11 12 1 2 3 4 5 4 6 7 8 9 10
#> [51] 11 12 1 2 3 4 5 4 6 7 8 9 10 11 12 1 2 3 4 5 4 6 7 8 9
#> [76] 10 11 12
#>
#> attr(,"types")
#> [1] "We" "aree" "drivng" "as" "fast" "we"
#> [7] "drove" "yestrday" "or" "evven" "fastter" "zysxzw"
#> attr(,"padding")
#> [1] FALSE
#> attr(,"what")
#> [1] "word"
#> attr(,"ngrams")
#> [1] 1
#> attr(,"skip")
#> [1] 0
#> attr(,"concatenator")
#> [1] "_"
#> attr(,"docvars")
#> data frame with 0 columns and 1 row
As you can see, text is split into vocabulary (types
) and position of the words. We can use this to optimize your code by performing all steps on the types
instead of the entire text:
spellAndStem_tokens <- function(sent, language = "en_US") {
sent_t <- quanteda::tokens(sent)
# extract types to only work on them
types <- quanteda::types(sent_t)
# spelling
correct <- hunspell_check(
words = as.character(types),
dict = hunspell::dictionary(language)
)
pattern <- types[!correct]
replacement <- sapply(hunspell_suggest(pattern, dict = language), FUN = "[", 1)
types <- stringi::stri_replace_all_fixed(
types,
pattern,
replacement,
vectorize_all = FALSE
)
# stemming
types <- hunspell_stem(types, dict = dictionary(language))
# replace original tokens
sent_t_new <- quanteda::tokens_replace(sent_t, quanteda::types(sent_t), as.character(types))
sent_t_new <- quanteda::tokens_remove(sent_t_new, pattern = "NULL", valuetype = "fixed")
paste(as.character(sent_t_new), collapse = " ")
}
I'm using the bench
package to do the benchmarking as it also checks if the results of the two functions are identical and as I find it more comfortable in general:
res <- bench::mark(
spellAndStem(sentence),
spellAndStem_tokens(sentence)
)
res
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 spellAndStem(sentence) 807ms 807ms 1.24 259KB 0
#> 2 spellAndStem_tokens(sentence) 148ms 150ms 6.61 289KB 0
summary(res, relative = TRUE)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 spellAndStem(sentence) 5.44 5.37 1 1 NaN
#> 2 spellAndStem_tokens(sentence) 1 1 5.33 1.11 NaN
The new function is 5.44 times faster than the original one. Note though that the difference is getting even more pronounced the larger the input text is:
sentence <- "We aree drivng as fast as we drove yestrday or evven fastter zysxzw" %>%
rep(times = 600) %>%
paste(collapse = " ")
res_big <- bench::mark(
spellAndStem(sentence),
spellAndStem_tokens(sentence)
)
res_big
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 spellAndStem(sentence) 1.27m 1.27m 0.0131 749.81KB 0
#> 2 spellAndStem_tokens(sentence) 178.26ms 182.12ms 5.51 1.94MB 0
summary(res_big, relative = TRUE)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 spellAndStem(sentence) 428. 419. 1 1 NaN
#> 2 spellAndStem_tokens(sentence) 1 1 420. 2.65 NaN
As you can see, the time it takes to process the 100 x bigger sample is almost the same as for the smaller one. This is because the vocabulary between the two is exactly the same. We can extrapolate from this result to your entire dataset assuming this bigger sample represents 100 of your documents. The function should take less than an hour (0.17826 * 14000 / 3600 = 0.69) but the calculation is really imperfect as the actual time it takes to run it on your real data will depend almost solely on the size of the vocabulary.
Besides the programming/performance aspect, I have a few more concerns that might not be applicable in your specific case:
sapply(as.list(sent_t_new), paste, collapse = " ")
as this will not collapse all documents into one long string but keep them separate.hunspell
could not find any suggestions for. I copied this approach (see the tokens_remove
command) but you might want to think about at least outputting the discarded words instead of removing them silently.spacyr
) or simply turning it off since stemming rarely improves results in German.hunspell_suggest
is just an expensive operation since it calculates the distance between your string and every word in the dictionary (see here: https://github.com/ropensci/hunspell/issues/7). When I remove the hunspell_suggest
lines, it only takes 25ms on average on my machine. So if you want to speed it up, this is the critical part. Note that it makes a difference how many incorrect words are in the actual documents. Your example with about 50% misspelled words should rather be the exception. Why don't you try the algorithm on the first couple of documents first to have a more realistic time estimate. I assume the language will matter (for your benefit) as there are more words in English than German (think dictionary size).
A simple and obvious thing to do would be to use multiple cores. Something simple as the following with the parallel
package already halves the time with my four cores:
sentences <- rep(sentence, 4)
microbenchmark(lapply = lapply(sentences, spellAndStem),
mclapply = parallel::mclapply(sentences, spellAndStem),
times = 10)
Unit: seconds
expr min lq mean median uq max neval cld
lapply(sentences, spellAndStem) 1.967008 2.023291 2.045705 2.051764 2.077168 2.105420 10 b
parallel::mclapply(sentences, spellAndStem) 1.011945 1.048055 1.078003 1.081850 1.109274 1.135508 10 a
The suggestion by Andrew Gustar could also work. Even if you just apply the suggest function to a group of documents, this should speed up the calculation significantly. The problem is to separate the documents and put them together after stemming---I guess a "separator" for the documents would just be stemmed and be unrecognizable afterwards. Judging from your question you have already tried this or something similar.
A smaller dictionary could also help, but is probably not a good idea if you want high-quality data.
By the way, I would not consider 11 days to be long for a computation that has to be done only once. You could simply upload the script to a Server that has R installed and let it run there via Rscript
from the shell (use nohup
to log out again without stopping the process). This is especially true if you have access to a strong "working machine" (e.g. at a university) with many cores.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With