Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improving performance of a loop with succeeding string replacements?

Tags:

r


I have (html-)texts and I want to change the ö things to real characters like ä, ü, ö, and so on because otherwise the xml-package does not accept it.

So I wrote a little function which cycles through a replacement table (link1, link2) and does replace special character by special character by sp... the function looks like this (only looonger):

html.charconv <- function(text){
    replacer <- matrix(c(
    "Á",    "&Aacute;",
    "á",    "&aacute;",
    "Â",    "&Acirc;",
    "â",    "&acirc;",
    "´",    "&acute;"
    )
    ,ncol=2,byrow=T)

    for(i in 1:length(replacer[,1])){
        text <- str_replace_all(text,replacer[i,2],replacer[i,1])
    }
    text
}

How might I speed this up? I thought about vectorization but did not come with any helping solution because for each cycle the result of the last cycle is its starting point.

like image 302
petermeissner Avatar asked Nov 30 '22 04:11

petermeissner


2 Answers

You can get a significant speedup by constructing your function a bit different, and forget about the text tools. Basically you :

  1. split the character string
  2. match the characters you want and replace them by the new characters
  3. paste everything together again

You can do that with following function :

html.fastconv <- function(x,old,new){
    xs <- strsplit(x,"&|;")
    old <- gsub("&|;","",old)
    xs <- lapply(xs,function(i){
        id <- match(i,old,0L)
        i[id!=0] <- new[id]
        return(i)
    })
    sapply(xs,paste,collapse="")
}

This works as :

> sometext <- c("&Aacute;dd som&aacute; le&Acirc;tter&acirc; acute problems et&acute; cetera",
+  "&Aacute;dd som&aacute; le&Acirc;tter&acirc; acute p ..." ... [TRUNCATED] 

> newchar <- c("Á","á","Â","â","´")

> oldchar <- c("&Aacute;","&aacute;","&Acirc;","&acirc;","&acute;")
> html.fastconv(sometext,oldchar,newchar)
[1] "Ádd somá leÂtterâ acute problems et´ cetera" "Ádd somá leÂtterâ acute problems et´ cetera"

For the record, some benchmarking :

require(rbenchmark)
benchmark(html.fastconv(sometext,oldchar,newchar),html.charconv(sometext),
     columns=c("test","elapsed","relative"),
     replications=1000) 
                                       test elapsed relative
2                   html.charconv(sometext)    0.79    5.643
1 html.fastconv(sometext, oldchar, newchar)    0.14    1.000
like image 140
Joris Meys Avatar answered Dec 05 '22 03:12

Joris Meys


Just for fun, here is a version based on Rcpp.

#include <Rcpp.h>
using namespace Rcpp ;

// [[Rcpp::export]]
CharacterVector rcpp_conv( 
    CharacterVector text, CharacterVector old , CharacterVector new_){

    int n  = text.size() ;
    int nr = old.size() ;

    std::string buffer, current_old, current_new ;
    size_t pos, current_size ; 
    CharacterVector res(n) ;

    for( int i=0; i<n; i++){
        buffer = text[i] ;
        for( int j=0; j<nr; j++){
             current_old = old[j] ;
             current_size = current_old.size() ;
             current_new = new_[j] ;
             pos = 0 ;   
             pos = buffer.find( current_old ) ;
             while( pos != std::string::npos ){
                 buffer.replace( 
                     pos, current_size, 
                     current_new
                 ) ;
                 pos = buffer.find( current_old ) ;
             }
        }
        res[i] = buffer ;
    }
    return res ;
}

For which I get quite a further performance gain:

> microbenchmark(
+     html.fastconv( sometext,oldchar,newchar),
+     html.fastconvJC(sometext, oldchar, newchar),
+     rcpp_conv( sometext, oldchar, newchar)
+ )
Unit: microseconds
                                         expr    min      lq   median      uq
1   html.fastconv(sometext, oldchar, newchar) 97.588 99.9845 101.4195 103.072
2 html.fastconvJC(sometext, oldchar, newchar) 19.945 23.3060  25.8110  28.134
3       rcpp_conv(sometext, oldchar, newchar)  4.047  5.1555   6.2340   9.275
      max
1 256.061
2  40.647
3  25.763

Here is an implementation based on the Rcpp::String feature, available from Rcpp >= 0.10.2:

class StringConv{
public:
    typedef String result_type ;
    StringConv( CharacterVector old_, CharacterVector new__): 
        nr(old_.size()), old(old_), new_(new__){}

    String operator()(String text) const {
        for( int i=0; i<nr; i++){
            text.replace_all( old[i], new_[i] ) ;
        }     
        return text ;
    }

private:
    int nr ;
    CharacterVector old ;
    CharacterVector new_ ;
} ;

// [[Rcpp::export]]
CharacterVector test_sapply_string( 
   CharacterVector text, CharacterVector old , CharacterVector new_
){
   CharacterVector res = sapply( text, StringConv( old, new_ ) ) ;
   return res ;
}  
like image 32
Romain Francois Avatar answered Dec 05 '22 02:12

Romain Francois