Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Passing by reference a data.frame and updating it with rcpp

Tags:

r

rcpp

looking at the rcpp documentation and Rcpp::DataFrame in the gallery I realized that I didn't know how to modify a DataFrame by reference. Googling a bit I found this post on SO and this post on the archive. There is nothing obvious so I suspect I miss something big like "It is already the case because" or "it does not make sense because".

I tried the following which compiled but the data.frame object passed to updateDFByRef in R stayed untouched

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
void updateDFByRef(DataFrame& df) {
    int N = df.nrows();
    NumericVector newCol(N,1.);
    df["newCol"] = newCol;
    return;
}
like image 538
statquant Avatar asked Mar 31 '13 15:03

statquant


2 Answers

The way DataFrame::operator[] is implemented indeed leeds to a copy when you do that:

df["newCol"] = newCol;

To do what you want, you need to consider what a data frame is, a list of vectors, with certain attributes. Then you can grab data from the original, by copying the vectors (the pointers, not their content).

Something like this does it. It is a little more work, but not that hard.

// [[Rcpp::export]]
List updateDFByRef(DataFrame& df, std::string name) {
    int nr = df.nrows(), nc= df.size() ;
    NumericVector newCol(nr,1.);
    List out(nc+1) ;
    CharacterVector onames = df.attr("names") ;
    CharacterVector names( nc + 1 ) ;
    for( int i=0; i<nc; i++) {
        out[i] = df[i] ;
        names[i] = onames[i] ;
    }
    out[nc] = newCol ;
    names[nc] = name ;
    out.attr("class") = df.attr("class") ;
    out.attr("row.names") = df.attr("row.names") ;
    out.attr("names") = names ;
    return out ;
}

There are issues associated with this approach. Your original data frame and the one you created share the same vectors and so bad things can happen. So only use this if you know what you are doing.

like image 139
Romain Francois Avatar answered Oct 16 '22 17:10

Romain Francois


The short answers is "because it makes no sense".

A data.frame is essentially a list of vectors. A few seconds of reflection makes it clear that adding a new column to that list entails a copy. So you alter your variable df in the example, do not return it and hence loose the modification.

Merely wishing for something to work a certain way is not always enough.

like image 34
Dirk Eddelbuettel Avatar answered Oct 16 '22 19:10

Dirk Eddelbuettel