Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Slice a string at consecutive indices with R / Rcpp?

Tags:

r

rcpp

I want to write a function that slices a 'string' into a vector, sequentially, at a given index. I have a fairly adequate R solution for it; however, I figure that writing the code in C/C++ would likely be faster. For example, I'd like to be able to write a function 'strslice' that operates as follows:

x <- "abcdef"
strslice( x, 2 ) ## should return c("ab", "cd", "ef")

However, I'm not sure how to handle treating elements of the 'CharacterVector' passed around in the Rcpp code as strings. This is what I imagine might work (given my lack of C++/Rcpp knowledge I'm sure there's a better approach):

f <- rcpp( signature(x="character", n="integer"), '
  std::string myString = Rcpp::as<std::string>(x);
  int cutpoint = Rcpp::as<int>(n);
  vector<std::string> outString;
  int len = myString.length();
  for( int i=0; i<len/n; i=i+n ) {
    outString.push_back( myString.substr(i,i+n-1 ) );
    myString = myString.substr(i+n, len-i*n);
  }
  return Rcpp::wrap<Rcpp::CharacterVector>( outString );
  ')

For the record, the corresponding R code I have is:

strslice <- function(x, n) {
  x <- as.data.frame( stringsAsFactors=FALSE, 
                      matrix( unlist( strsplit( x, "" ) ), ncol=n, byrow=T )
  )

  do.call( function(...) { paste(..., sep="") }, x )

}

...but I figure jumping around between data structures so much will slow things down with very large strings.

(Alternatively: is there a way to coerce 'strsplit' into behaving as I want?)

like image 503
Kevin Ushey Avatar asked Nov 10 '12 06:11

Kevin Ushey


2 Answers

I would use substring. Something like this:

strslice <- function( x, n ){   
    starts <- seq( 1L, nchar(x), by = n )
    substring( x, starts, starts + n-1L )
}
strslice( "abcdef", 2 )
# [1] "ab" "cd" "ef"

About your Rcpp code, maybe you can allocate the std::vector<std::string> with the right size, so that you avoid resizing it which might mean memory allocations, ... or perhaps directly use a Rcpp::CharacterVector. Something like this:

strslice_rcpp <- rcpp( signature(x="character", n="integer"), '
    std::string myString = as<std::string>(x);
    int cutpoint = as<int>(n);
    int len = myString.length();
    int nout = len / cutpoint ;
    CharacterVector out( nout ) ;
    for( int i=0; i<nout; i++ ) {
      out[i] = myString.substr( cutpoint*i, 2 ) ;
    }
    return out ;
')
strslice_rcpp( "abdcefg", 2 )
# [1] "ab" "cd" "ef"
like image 131
Romain Francois Avatar answered Oct 18 '22 09:10

Romain Francois


This one-liner using strapplyc from the gsubfn package is fast enough that rcpp may not be needed. Here we apply it to the entire text of James Joyce's Ulysses which only takes a few seconds:

library(gsubfn)
joyce <- readLines("http://www.gutenberg.org/files/4300/4300-8.txt") 
joycec <- paste(joyce, collapse = " ") # all in one string 
n <- 2
system.time(s <- strapplyc(joycec, paste(rep(".", n), collapse = ""))[[1]])
like image 22
G. Grothendieck Avatar answered Oct 18 '22 11:10

G. Grothendieck