Faster approach than gsub in r

Tags:

I'm trying to find out, if there is faster approach than gsub vectorized function in R. I have following data frame with some "sentences" (sent$words) and then I have words for removing from these sentences (stored in wordsForRemoving variable).

sent <- data.frame(words = 
                     c("just right size and i love this notebook", "benefits great laptop",
                       "wouldnt bad notebook", "very good quality", "bad orgtop but great",
                       "great improvement for that bad product but overall is not good", 
                       "notebook is not good but i love batterytop"), 
                   user = c(1,2,3,4,5,6,7),
                   stringsAsFactors=F)

wordsForRemoving <- c("great","improvement","love","great improvement","very good","good",
                      "right", "very","benefits", "extra","benefit","top","extraordinarily",
                      "extraordinary", "super","benefits super","good","benefits great",
                      "wouldnt bad")

Then I'm gonna create "big data" simulation for time consumption computing...

df.expanded <- as.data.frame(replicate(1000000,sent$words))
library(zoo)
sent <- coredata(sent)[rep(seq(nrow(sent)),1000000),]
rownames(sent) <- NULL

Using of following gsub approach for removing words (wordsForRemoving) from sent$words takes 72.87 sec. I know, this is not good simulation but in real I'm using word dictionary with more than 3.000 words for 300.000 sentences and overall processing takes over 1.5 hours.

pattern <- paste0("\\b(?:", paste(wordsForRemoving, collapse = "|"), ")\\b ?")
res <- gsub(pattern, "", sent$words)

#  user  system elapsed 
# 72.87    0.05   73.79

Please, could anyone help me to write faster approach for my task. Any help or advice is very appreciated. Thanks a lot in forward.

836

asked Mar 26 '15 08:03

martinkabe

3 Answers

As mentioned by Jason, stringi is good option for you..

Following is the performance of stringi

system.time(res <- gsub(pattern, "", sent$words))
   user  system elapsed 
 66.229   0.000  66.199 

library(stringi)
system.time(stri_replace_all_regex(sent$words, pattern, ""))
   user  system elapsed 
 21.246   0.320  21.552

Update (Thanks Arun)

system.time(res <- gsub(pattern, "", sent$words, perl = TRUE))
   user  system elapsed 
 12.290   0.000  12.281

189

answered Oct 25 '22 12:10

vrajs5

This is not a real answer, as I didnt find any method that is always faster. Apparently it depends on the length of your text/vector. With short texts gsub performs fastest. With longer texts or vectors sometimes gsub with perl=TRUE and sometimes stri_replace_all_regex perform the fastest.

Here is some test code to try out:

library(stringi)
text = "(a1,\"something (f fdd71)\");(b2,\"something else (a fa171)\");(b4,\"something else (a fa171)\")"
# text = paste(rep(text, 5), collapse = ",")
# text = rep(text, 100)
nchar(text)

a = gsub(pattern = "[()]", replacement = "", x = text)
b = gsub(pattern = "[()]", replacement = "", x = text, perl=T)
c = stri_replace_all_regex(str = text, pattern = "[()]", replacement = "")
d = stri_replace(str = text, regex = "[()]", replacement = "", mode="all")

identical(a,b); identical(a,c); identical(a,d)

library(microbenchmark)
mc <- microbenchmark(
  gsub = gsub(pattern = "[()]", replacement = "", x = text),
  gsub_perl = gsub(pattern = "[()]", replacement = "", x = text, perl=T),
  stringi_all = stri_replace_all_regex(str = text, pattern = "[()]", replacement = ""),
  stringi = stri_replace(str = text, regex = "[()]", replacement = "", mode="all")
)
mc

Unit: microseconds
        expr    min      lq     mean  median     uq     max neval  cld
        gsub 10.868 11.7740 13.47869 13.5840 14.490  31.394   100 a   
   gsub_perl 79.690 80.2945 82.58225 82.4070 83.312 137.043   100    d
 stringi_all 14.188 14.7920 15.58558 15.5460 16.301  17.509   100  b  
     stringi 36.828 38.0350 39.90904 38.7895 39.543 129.194   100   c

answered Oct 25 '22 11:10

SeGa

I built two tokenizer functions with one difference, the first function uses gsub the second one uses str_replace_all from the stringr package.
Here's function number one:

tokenize_gsub <- function(df){

    require(lexicon)
    require(dplyr)
    require(tidyr)
    require(tidytext)
    myStopWords <- c(
        "ø",
        "øthe",
        "iii"
    )

    profanity <- c(
        profanity_alvarez,
        profanity_arr_bad,
        profanity_banned,
        profanity_racist,
        profanity_zac_anger
    ) %>%
        unique()

    df %>%
        mutate(text = gsub(x = text, pattern = "[0-9]+|[[:punct:]]|\\(.*\\)", replacement = "")) %>%
        unnest_tokens(word, text) %>%
        anti_join(stop_words, by = "word") %>%
        anti_join(tibble(word = profanity), by = "word") %>%
        anti_join(tibble(word = myStopWords), by = "word")

}

Here's function number two:

tokenize_stringr <- function(df){

    require(stringr)
    require(lexicon)
    require(dplyr)
    require(tidyr)
    require(tidytext)

    myStopWords <- c(
        "ø",
        "øthe",
        "iii"
    )

    profanity <- c(
        profanity_alvarez,
        profanity_arr_bad,
        profanity_banned,
        profanity_racist,
        profanity_zac_anger
    ) %>%
        unique()

    df %>%
        mutate(text = str_replace_all(text, "[0-9]+|[[:punct:]]|\\(.*\\)", "")) %>%
        unnest_tokens(word, text) %>%
        anti_join(stop_words, by = "word") %>%
        anti_join(tibble(word = profanity), by = "word") %>%
        anti_join(tibble(word = myStopWords), by = "word")

}

Then I used a benchmarking function to compare performance with a dataset containing 4,269,678 social media posts (twitter, blogs, etc.)

library(microbenchmark)
mc <- microbenchmark(
    gsubOption = tokenize_gsub(englishPosts),
    stringrOption = tokenize_stringr(englishPosts)
)

mc

Here's the output:

Unit: seconds
          expr      min       lq     mean   median       uq      max neval cld
    gsubOption 161.4945 175.3040 211.6979 197.5054 240.6451 376.2927   100   b
 stringrOption 101.4138 117.0748 142.9605 132.4253 159.6291 328.1517   100  a

CONCLUSION: The function str_replace_all is considerably faster than the gsub option under the conditions explained above.

answered Oct 25 '22 10:10

Cucurucho

Related questions
                            
                                NotePad++ replace problem
                            
                                Escape function for regular expression or LIKE patterns
                            
                                What does the "g" stand for in Ruby's "gsub" and in Vim's substitution command?
                            
                                Python - escaping double quotes using string.replace
                            
                                How can I interpolate a variable in a Ruby regex?
                            
                                Filtering a diff with a regular expression
                            
                                Bash regex matching not working [duplicate]
                            
                                Use regex to insert space between collapsed words
                            
                                Matching a^n b^n c^n for n > 0 with PCRE
                            
                                Using sed to replace tab with spaces
                            
                                How do I match a number inside square brackets with regex
                            
                                Regex for java's String.matches method?
                            
                                regex: How to escape backslashes and special characters?
                            
                                Regular expression to remove line breaks
                            
                                Regex match zero or one time a string
                            
                                Regex to check if first 2 characters in a string are Alphabets
                            
                                How to remove text inside an element with jQuery?
                            
                                Using regular expressions to find a word with the five letters abcde, each letter appearing exactly once, in any order, with no breaks in between
                            
                                Difference between \b and \s in Regular Expression
                            
                                Overlapping matches in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With