I am trying to pass a character vector (i.e. vector of strings) from R to C/C++ for sorting and other purposes. When using Rcpp, this can easily be done with the following code:
#include <Rcpp.h>
#include <vector>
#include <string>
using namespace Rcpp;
// [[Rcpp::export]]
CharacterVector sort(CharacterVector x) {
std::sort(x.begin(), x.end());
return x;
}
However, as this is my only planned use of C++ in this package, it hardly seems worthwhile to introduce the dependency on Rcpp. Doing the same without it isn't so easy. Integers are easy enough:
#include <R.h>
#include <Rdefines.h>
#include <algorithm>
#include <string.h>
using namespace std;
SEXP sort(SEXP x) {
int* xx = INTEGER(x);
std::sort(xx, xx+LENGTH(x));
return(x);
}
But there isn't std::vector<string>
or char**
equivalent to INTEGER()
.
How can I emulate the same code without introducing the dependency to Rcpp?
There are questions on here that discuss how to use CHAR(STRING_ELT())
to convert a single string, but it is not clear how to convert into an array/vector of strings.
Convert elements of a Vector to Strings in R Language – toString() Function. toString() function in R Programming Language is used to produce a single character string describing an R object.
SEXP op : an “offset pointer”. This is used when multiple R functions use the same C function. For example do_logic() implements & , | , and ! . show_c_source() prints this out for you. SEXP args : a pairlist containing the unevaluated arguments to the function.
STRING_PTR()
is analogous to INTEGER()
. It returns a SEXP *
. Dereferencing the value is the SEXP of the first element of the R character vector; STRING_PTR(x)[0]
is the same as STRING_ELT(x, 0)
. The SEXP is itself a pointer to a data structure that contains a const char *
to the actual characters of the first element of the character vector; this pointer might be accessed by CHAR(STRING_PTR(x)[i])
. The various macros are defined in file.path(R.home("include"), "Rinternals.h")
The default std::sort(STRING_PTR(x), STRING_PTR(x) + LENGTH(x))
compares the de-referenced values of the arguments, i.e., compares the pointer addresses of the elements -- STRING_PTR(x)[i] < STRING_PTR(x)[j]
. What you want is to compare the actual null-terminated C strings, strcmp(CHAR(STRING_PTR(x)[i]), CHAR(STRING_PTR(x)[j]))
. Your original std::sort(x.begin(), x.end())
does not actually return sorted strings (I think the Rcpp way of doing the sort is x.sort()
).
You need a custom comparator that takes SEXP, extracts the const char *
(via the CHAR()
macro), and compares these. Here's the comparator
struct CMP_CHAR {
bool operator()(SEXP x, SEXP y) {
return strcmp(CHAR(x), CHAR(y)) < 0;
}
} cmp_char;
and the implementation
// [[Rcpp::export]]
SEXP sortcpp0(SEXP x) {
std::sort(STRING_PTR(x), STRING_PTR(x) + LENGTH(x), cmp_char);
return x;
}
(I use Rcpp in the example above to make it easy to continue to use sourceCpp()
, but this could be removed and the code compiled with R CMD SHLIB
or used in a package without depending on Rcpp).
Note though that this is a VERY BAD idea, because the direct manipulation of an object passed from R without copying breaks the illusion of copy-on-change and introduces action at a distance -- in the below, x
and y
are changed even though only x
is sorted.
> n = 10; set.seed(123); x = y = sample(as.character(1:n))
> sortcpp0(x); y
[1] "1" "10" "2" "3" "4" "5" "6" "7" "8" "9"
[1] "1" "10" "2" "3" "4" "5" "6" "7" "8" "9"
I believe that the naive Rcpp implementation
// [[Rcpp::export]]
CharacterVector sortRcpp2(CharacterVector x) {
return x.sort();
}
also suffers from this -- the input argument needs to be duplicated before sorting. The solution is simple -- duplicate (and PROTECT! even though not technically needed in the present example) the incoming argument.
// [[Rcpp::export]]
SEXP sortcpp(SEXP x) {
x = PROTECT(Rf_duplicate(x));
std::sort(STRING_PTR(x), STRING_PTR(x) + LENGTH(x), cmp_char);
UNPROTECT(x);
return x;
}
with the expected behavior
> n = 10; set.seed(123); x = y = sample(as.character(1:n))
> sortcpp(x); y
[1] "1" "10" "2" "3" "4" "5" "6" "7" "8" "9"
[1] "3" "8" "4" "7" "6" "1" "10" "9" "2" "5"
I think there is some trivial Rcpp incantation (Rcpp::clone()
, thanks @DirkEddelbuettel) for duplicating the argument, too, and this should be considered a required practice for any argument that is altered by the C++ code.
@Spacedman points out that R's sort is maybe ok -- it falls quickly to C, has a reasonably fast (though not fastest) sort implementation, and has lots of nice functionality built in (e.g., handling NA's, and different system locales; the latter influence sort order in very subtle ways). But there are still big gains to be had (I was surprised by this...)
> library(microbenchmark)
> n = 1e4; set.seed(123); x = sample(as.character(1:n))
> identical(sort(x), sortcpp(x))
[1] TRUE
> microbenchmark(sort(x), sortcpp(x))
Unit: milliseconds
expr min lq mean median uq max neval
sort(x) 56.061580 56.563541 57.034674 56.997618 57.667031 59.003068 100
sortcpp(x) 3.542409 3.556655 3.610071 3.582562 3.662196 3.800319 100
Finally, despite the comments on the post, and the closed original post, searching StackOverflow for [r] STRING_PTR returns exactly one hit -- this answer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With