Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does Rccp return a list-like output when I was expecting a dataframe output in R?

Tags:

c++

r

rcpp

I am trying to write a .cpp that takes an input vector and outputs a two-column dataframe with all possible combinations from the input vector. My output gives the desired values, but not as a dataframe. What do I change in the .cpp file to get a dataframe output?

My possible_combos.cpp file looks like this:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
GenericVector C_all_combos(GenericVector a) {
  int vec_length = a.size();
  int vec_length_sq = vec_length*vec_length; 
  GenericVector expand_vector_a(vec_length_sq);
  GenericVector expand_vector_b(vec_length_sq);  
  for (int i=0; i<vec_length_sq; i++) { expand_vector_a[i] = a[i / vec_length]; };
  for (int i=0; i<vec_length_sq; i++) { expand_vector_b[i] = a[i % vec_length]; };
  DataFrame my_df = DataFrame::create(Named("v_1") = expand_vector_a,
                                    Named("v_2") = expand_vector_b);
return my_df;
}

/*** R
C_all_combos(c(1, "Cars", 2.3))
  */

The desired output from running Rcpp::sourceCpp("possible_combos.cpp") is:

    v_1    v_2
    1       1
    1       Cars
    1       2.3
    Cars    1
    Cars    Cars
    Cars    2.3
    2.3     1
    2.3     Cars
    2.3     2.3

But what I get is:

    v_1..1. v_1..1..1 v_1..1..2 v_1..Cars. v_1..Cars..1 v_1..Cars..2 v_1..2.3. v_1..2.3..1 v_1..2.3..2
1       1         1         1       Cars         Cars         Cars       2.3         2.3         2.3
  v_2..1. v_2..Cars. v_2..2.3. v_2..1..1 v_2..Cars..1 v_2..2.3..1 v_2..1..2 v_2..Cars..2 v_2..2.3..2
1       1       Cars       2.3         1         Cars         2.3         1         Cars         2.3

Thanks for any tips! I'm familiar with excellent R functions like expand.grid(), but want to experiment with alternatives.

like image 244
SEAnalyst Avatar asked Aug 20 '20 05:08

SEAnalyst


1 Answers

The main issue is that Rcpp::GenericVector is a list so the behavior is consistent with R. I show this below and a solution which has a special case for each type of input using a template function

#include <Rcpp.h>
using namespace Rcpp;

// essentially your code
// [[Rcpp::export]]
DataFrame C_all_combos(GenericVector a) {
  size_t const vec_length = a.size(),
            vec_length_sq = vec_length * vec_length; 
  GenericVector expand_vector_a(vec_length_sq),
                expand_vector_b(vec_length_sq);  
  
  for (size_t i = 0; i < vec_length_sq; i++){ 
    expand_vector_a[i] = a[i / vec_length];
    expand_vector_b[i] = a[i % vec_length];
  }
  
  return DataFrame::create(_["v_1"] = expand_vector_a,
                           _["v_2"] = expand_vector_b, 
                           _["stringsAsFactors"] = false);
}

// template function used in the new solution
template<class T>
DataFrame C_all_combos_gen(T a) {
  size_t const vec_length = a.size(),
                  vec_length_sq = vec_length * vec_length; 
  T expand_vector_a(vec_length_sq),
    expand_vector_b(vec_length_sq);  
  
  for (size_t i = 0; i < vec_length_sq; i++){ 
    expand_vector_a[i] = a[i / vec_length];
    expand_vector_b[i] = a[i % vec_length];
  }
  
  return DataFrame::create(_["v_1"] = expand_vector_a,
                           _["v_2"] = expand_vector_b, 
                           _["stringsAsFactors"] = false);
}

// export particular versions
// [[Rcpp::export]]
DataFrame C_all_combos_int(IntegerVector a){
  return C_all_combos_gen<IntegerVector>(a);
}

// [[Rcpp::export]]
DataFrame C_all_combos_char(CharacterVector a){
  return C_all_combos_gen<CharacterVector>(a);
}

// [[Rcpp::export]]
DataFrame C_all_combos_num(NumericVector a){
  return C_all_combos_gen<NumericVector>(a);
}

// [[Rcpp::export]]
DataFrame C_all_combos_log(LogicalVector a){
  return C_all_combos_gen<LogicalVector>(a);
}

We can now run the following R code which

  1. illustrates that the behavior in your code is consistent with R.
  2. shows that the solution works.
######
# the issue with your code. Repeat your call
C_all_combos(c(1, "Cars", 2.3))
#R>   v_1..1. v_1..1..1 v_1..1..2 v_1..Cars. v_1..Cars..1 v_1..Cars..2 v_1..2.3. v_1..2.3..1 v_1..2.3..2 v_2..1. v_2..Cars. v_2..2.3. v_2..1..1 v_2..Cars..1 v_2..2.3..1 v_2..1..2
#R> 1       1         1         1       Cars         Cars         Cars       2.3         2.3         2.3       1       Cars       2.3         1         Cars         2.3         1
#R>   v_2..Cars..2 v_2..2.3..2
#R> 1         Cars         2.3

# amounts to doing the following in R which yields the same
all_combs <- expand.grid(v_1 = c(1, "Cars", 2.3), v_2 = c(1, "Cars", 2.3), 
                         stringsAsFactors = FALSE)
data.frame(v_1 = as.list(all_combs$v_2), 
           v_2 = as.list(all_combs$v_1))
#R>   v_1..1. v_1..1..1 v_1..1..2 v_1..Cars. v_1..Cars..1 v_1..Cars..2 v_1..2.3. v_1..2.3..1 v_1..2.3..2 v_2..1. v_2..Cars. v_2..2.3. v_2..1..1 v_2..Cars..1 v_2..2.3..1 v_2..1..2
#R> 1       1         1         1       Cars         Cars         Cars       2.3         2.3         2.3       1       Cars       2.3         1         Cars         2.3         1
#R>   v_2..Cars..2 v_2..2.3..2
#R> 1         Cars         2.3

######
# here is a solution with the template function
C_all_combos_R <- function(a){
  if(is.logical(a))
    return(C_all_combos_log(a))
  else if(is.integer(a))
    return(C_all_combos_int(a))
  else if(is.numeric(a))
    return(C_all_combos_num(a))
  else if(is.character(a))
    return(C_all_combos_char(a))
  
  stop("C_all_combos_R not implemented")
}

# it works
C_all_combos_R(c(1, "Cars", 2.3))
#R>    v_1  v_2
#R> 1    1    1
#R> 2    1 Cars
#R> 3    1  2.3
#R> 4 Cars    1
#R> 5 Cars Cars
#R> 6 Cars  2.3
#R> 7  2.3    1
#R> 8  2.3 Cars
#R> 9  2.3  2.3

Doing the type checking in C++ and more

You can also do all the type checking in C++, avoid the expensive integer division and modulus operation, and avoid the DataFrame constructor like AEF like this

#include <Rcpp.h>
using namespace Rcpp;

template<int T>
SEXP C_all_combos_gen_two(Vector<T> a) {
  size_t const vec_length = a.size(),
            vec_length_sq = vec_length * vec_length; 
  Vector<T> expand_vector_a(vec_length_sq),
            expand_vector_b(vec_length_sq);  
  
  size_t i(0L);
  for(size_t jj = 0L; jj < vec_length; ++jj)
    for(size_t ii = 0L; ii < vec_length; ++i, ++ii){
      expand_vector_a[i] = a[jj];
      expand_vector_b[i] = a[ii];
    }
  
  List out = List::create(_["v_1"] = expand_vector_a,
                          _["v_2"] = expand_vector_b);
  
  out.attr("class") = "data.frame";
  out.attr("row.names") = Rcpp::seq(1, vec_length_sq);
  
  return out;
}

// [[Rcpp::export]]
SEXP C_all_combos_cpp(SEXP a){
  switch( TYPEOF(a) ){
  case INTSXP : return C_all_combos_gen_two<INTSXP>(a);
  case REALSXP: return C_all_combos_gen_two<REALSXP>(a);
  case STRSXP : return C_all_combos_gen_two<STRSXP>(a);
  case LGLSXP : return C_all_combos_gen_two<LGLSXP>(a);
  case VECSXP : return C_all_combos_gen_two<VECSXP>(a);
  default: Rcpp::stop("C_all_combos_cpp not implemented");
  }
  
  return DataFrame();
}

The new version yields

C_all_combos_cpp(c(1, "Cars", 2.3))
#R>    v_1  v_2
#R> 1    1    1
#R> 2    1 Cars
#R> 3    1  2.3
#R> 4 Cars    1
#R> 5 Cars Cars
#R> 6 Cars  2.3
#R> 7  2.3    1
#R> 8  2.3 Cars
#R> 9  2.3  2.3

and it is fast compared with AEF's solution

C_all_combos_cpp(c(1, "Cars", 2.3))

options(digits = 3)
library(bench)
mark(C_all_combos_cpp = C_all_combos_cpp(c(1, "Cars", 2.3)),
     AEF              = C_all_combos_aef(c(1, "Cars", 2.3)), check = FALSE)
#R> # A tibble: 2 x 13
#R>   expression            min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time  
#R>   <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> 
#R> 1 C_all_combos_cpp   4.05µs   5.49µs   169097.    6.62KB     16.9  9999     1     59.1ms
#R> 2 AEF               15.76µs  16.96µs    57030.    2.49KB     45.7  9992     8    175.2ms

larger_num <- rnorm(100)
mark(C_all_combos_cpp = C_all_combos_cpp(larger_num),
     AEF              = C_all_combos_aef(larger_num), check = FALSE)
#R> # A tibble: 2 x 13
#R>   expression            min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#R>   <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
#R> 1 C_all_combos_cpp   30.9µs   37.7µs    20817.     198KB     88.0  6862    29      330ms
#R> 2 AEF               167.9µs  178.4µs     5558.     199KB     21.5  2585    10      465ms

For completeness, here is the extra C++ code

// [[Rcpp::export]]
SEXP C_all_combos_aef(GenericVector a) {
  int vec_length = a.size();
  int vec_length_sq = vec_length * vec_length;
  GenericVector expand_vector_a(vec_length_sq);
  GenericVector expand_vector_b(vec_length_sq);
  for (int i=0; i<vec_length_sq; i++) { expand_vector_a[i] = a[i / vec_length]; };
  for (int i=0; i<vec_length_sq; i++) { expand_vector_b[i] = a[i % vec_length]; };
  
  List my_df = List::create(Named("v_1") = expand_vector_a,
                            Named("v_2") = expand_vector_b);
  
  
  my_df.attr("class") = "data.frame";
  my_df.attr("row.names") = Rcpp::seq(1, vec_length_sq);
  
  return my_df;
}
like image 174
Benjamin Christoffersen Avatar answered Sep 30 '22 15:09

Benjamin Christoffersen