I want to write a function that slices a 'string' into a vector, sequentially, at a given index. I have a fairly adequate R solution for it; however, I figure that writing the code in C/C++ would likely be faster. For example, I'd like to be able to write a function 'strslice' that operates as follows:
x <- "abcdef"
strslice( x, 2 ) ## should return c("ab", "cd", "ef")
However, I'm not sure how to handle treating elements of the 'CharacterVector' passed around in the Rcpp code as strings. This is what I imagine might work (given my lack of C++/Rcpp knowledge I'm sure there's a better approach):
f <- rcpp( signature(x="character", n="integer"), '
std::string myString = Rcpp::as<std::string>(x);
int cutpoint = Rcpp::as<int>(n);
vector<std::string> outString;
int len = myString.length();
for( int i=0; i<len/n; i=i+n ) {
outString.push_back( myString.substr(i,i+n-1 ) );
myString = myString.substr(i+n, len-i*n);
}
return Rcpp::wrap<Rcpp::CharacterVector>( outString );
')
For the record, the corresponding R code I have is:
strslice <- function(x, n) {
x <- as.data.frame( stringsAsFactors=FALSE,
matrix( unlist( strsplit( x, "" ) ), ncol=n, byrow=T )
)
do.call( function(...) { paste(..., sep="") }, x )
}
...but I figure jumping around between data structures so much will slow things down with very large strings.
(Alternatively: is there a way to coerce 'strsplit' into behaving as I want?)
I would use substring
. Something like this:
strslice <- function( x, n ){
starts <- seq( 1L, nchar(x), by = n )
substring( x, starts, starts + n-1L )
}
strslice( "abcdef", 2 )
# [1] "ab" "cd" "ef"
About your Rcpp
code, maybe you can allocate the std::vector<std::string>
with the right size, so that you avoid resizing it which might mean memory allocations, ... or perhaps directly use a Rcpp::CharacterVector
. Something like this:
strslice_rcpp <- rcpp( signature(x="character", n="integer"), '
std::string myString = as<std::string>(x);
int cutpoint = as<int>(n);
int len = myString.length();
int nout = len / cutpoint ;
CharacterVector out( nout ) ;
for( int i=0; i<nout; i++ ) {
out[i] = myString.substr( cutpoint*i, 2 ) ;
}
return out ;
')
strslice_rcpp( "abdcefg", 2 )
# [1] "ab" "cd" "ef"
This one-liner using strapplyc
from the gsubfn package is fast enough that rcpp may not be needed. Here we apply it to the entire text of James Joyce's Ulysses which only takes a few seconds:
library(gsubfn)
joyce <- readLines("http://www.gutenberg.org/files/4300/4300-8.txt")
joycec <- paste(joyce, collapse = " ") # all in one string
n <- 2
system.time(s <- strapplyc(joycec, paste(rep(".", n), collapse = ""))[[1]])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With