Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently replace a fixed position substring with a string of equal or larger length

Tags:

r

rcpp

What would be an efficient way of replacing a fixed position substring with another string of equal or larger length?

For example, the following replaces the substring "abc" by finding the position of "abc" first and then replacing it:

sub("abc", "123", "iabc.def", fixed = TRUE)
#[1] "i123.def"

sub("abc", "1234", "iabc.def", fixed = TRUE)
#[1] "i1234.def"

However, we know that the substring "abc" is ALWAYS in character positions 2, 3 and 4. In this case, is there a way of specifying those positions so that the string matching doesn't need to be performed and the character indices used instead?

I did try to use substr() but it didn't work as I had hoped when the replacement string is larger than the substring being replaced:

x <- "iabc.def"
substr(x, 2, 4) <- "123"
#[1] "i123.def"

x <- "iabc.def"
substr(x, 2, 4) <- "1234"
#[1] "i123.def"

Many thanks in advance for your time,

Tony Breyal

P.S. The above may be the most efficient way of doing what I want but I thought I would ask just in case there is a better way :)

===== TIMINGS =====

#                             test elapsed  relative
# 7 francois.fx_wb(x, replacement)    0.94  1.000000
# 1                           f(x)    1.56  1.659574
# 6    francois.fx(x, replacement)    2.23  2.372340
# 5                      Sobala(x)    3.89  4.138298
# 2                    Hong.Ooi(x)    4.41  4.691489
# 3                        DWin(x)    5.57  5.925532
# 4                      hadley(x)    9.47 10.074468

The above timings were generated from the code below:

library(rbenchmark)
library(stringr)
library(Rcpp)
library(inline)

f <- function(x, replacement = "1234")  sub("abc", replacement, x, fixed = TRUE)

Hong.Ooi <- function(x, replacement = "1234") paste(substr(x, 1, 1), replacement, substr(x, 5, nchar(x)), sep = "")

DWin <- function(x, replacement =  paste("\\1", "1234", sep = "")) sub("^(.)abc", replacement, x)

Sobala <- function(x, replacement =  paste("\\1", "1234", sep = ""))  sub("^(.).{3}", replacement, x, perl=TRUE)

hadley <- function(x, replacement = "1234") {
  str_sub(x, 2, 4) <- replacement
  return(x)
}

francois.fx <- cxxfunction( signature( x_ = "character", rep_ = "character" ), '

    const char* rep =as<const char*>(rep_) ;
    CharacterVector x(x_) ;
    int nrep = strlen( rep ) ;
    int n = x.size() ; 
    CharacterVector res(n) ;

    char buffer[1024] ;

    for(int i=0; i<n; i++) {
        const char* xi = x[i] ;
        if( strncmp( xi+1, "abc", 3 ) ) {
            res[i] = x[i] ;
        } else{
            buffer[0] = xi[0] ;
            strncpy( buffer + 1, rep, nrep ) ;
            strcpy( buffer + 1 + nrep, xi + 4 ) ;
            res[i] = buffer ;
        }
    }
    return res ;
', plugin = "Rcpp" )

francois.fx_wb <- cxxfunction( signature( x_ = "character", rep_ = "character" ), '

    const char* rep =as<const char*>(rep_) ;
    int nrep = strlen( rep ) ;
    int n=Rf_length(x_) ;
    SEXP res = PROTECT( Rf_allocVector( STRSXP, n ) ) ;

    char buffer[1024] ;

    for(int i=0; i<n; i++) {
        const char* xi = char_get_string_elt(x_, i) ;
        if( strncmp( xi+1, "abc", 3 ) ) {
            set_string_elt( res, i, get_string_elt(x_,i)) ;
        } else{
            buffer[0] = xi[0] ;
            strncpy( buffer + 1, rep, nrep ) ;
            strcpy( buffer + 1 + nrep, xi + 4 ) ;
            char_set_string_elt(res, i, buffer ) ;
        }
    }
    UNPROTECT(1) ;
    return res ;
', plugin = "Rcpp" )


x <- rep("iabc.def", 1e6)
replacement <- "1234"
benchmark(f(x), Hong.Ooi(x), DWin(x), hadley(x), Sobala(x), francois.fx(x, replacement), francois.fx_wb(x, replacement),
          columns = c("test", "elapsed", "relative"),
          order = "relative",
          replications = 10)
like image 512
Tony Breyal Avatar asked Dec 10 '11 10:12

Tony Breyal


1 Answers

Here is one solution based on Rcpp.

fx <- cxxfunction( signature( x_ = "character", rep_ = "character" ), '

    const char* rep =as<const char*>(rep_) ;
    CharacterVector x(x_) ;
    int nrep = strlen( rep ) ;
    int n = x.size() ; 
    CharacterVector res(n) ;

    char buffer[1024] ;

    for(int i=0; i<n; i++) {
        const char* xi = x[i] ;
        if( strncmp( xi+1, "abc", 3 ) ) {
            res[i] = x[i] ;
        } else{
            buffer[0] = xi[0] ;
            strncpy( buffer + 1, rep, nrep ) ;
            strcpy( buffer + 1 + nrep, xi + 4 ) ;
            res[i] = buffer ;
        }
    }
    return res ;
', plugin = "Rcpp" )

it does not improve much on the simple sub solution because write access to strings in R are protected by the write barrier. I get better results if I cheat on the write barrier, but I'm not fully aware of the consequences, so I should probably advise against it :/

fx_wb <- cxxfunction( signature( x_ = "character", rep_ = "character" ), '

    const char* rep =as<const char*>(rep_) ;
    int nrep = strlen( rep ) ;
    int n=Rf_length(x_) ;
    SEXP res = PROTECT( Rf_allocVector( STRSXP, n ) ) ;

    char buffer[1024] ;

    for(int i=0; i<n; i++) {
        const char* xi = char_get_string_elt(x_, i) ;
        if( strncmp( xi+1, "abc", 3 ) ) {
            set_string_elt( res, i, get_string_elt(x_,i)) ;
        } else{
            buffer[0] = xi[0] ;
            strncpy( buffer + 1, rep, nrep ) ;
            strcpy( buffer + 1 + nrep, xi + 4 ) ;
            char_set_string_elt(res, i, buffer ) ;
        }
    }
    UNPROTECT(1) ;
    return res ;
', plugin = "Rcpp" )

Write barrier

The R Internals manual describes the write barrier:

A generational collector needs to efficiently ‘age’ the objects, especially list-like objects (including STRSXPs). This is done by ensuring that the elements of a list are regarded as at least as old as the list when they are assigned. This is handled by the functions SET_VECTOR_ELT and SET_STRING_ELT, which is why they are functions and not macros. Ensuring the integrity of such operations is termed the write barrier and is done by making the SEXP opaque and only providing access via functions (which cannot be used as lvalues in assignments in C).

All code in R extensions is by default behind the write barrier.

And Luke Tierney's document describes the logic behind why:

The generational collector divides allocated nodes into generations based on some notion of age. Younger generations are collected more frequently than older ones. For this to work correctly, any younger nodes that are reachable only from older nodes must be handled properly. This is accomplished by a write barrier that monitors each assignment and takes appropriate action when a reference to a new node is placed in an older one.

like image 84
Romain Francois Avatar answered Sep 27 '22 19:09

Romain Francois