Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle C++ internal data structure in R in order to allow save/load?

Tags:

c++

c

r

In R, I am using an internal C++ structure to store the data (Using R_ExternalPtr). The data can then be processed using various functions. Here is a simple example (the real data structure is much more complex):

#include <Rinternals.h>
class MyObject{
        public:
        int type;
        MyObject(int t):type(t){}
};


void finalizeMyObject(SEXP ptr){
    MyObject * mo= static_cast<MyObject *>(R_ExternalPtrAddr(ptr));
    delete mo;
}

extern "C" {

    SEXP CreateObject(SEXP type) {
        SEXP ans;
        int nb = length(type);
        PROTECT(ans=allocVector(VECSXP, nb));
        int * typeP = INTEGER(type);
        SEXP cl;
        PROTECT(cl = allocVector(STRSXP, 1));
        SET_STRING_ELT(cl, 0, mkChar("myObject"));
        for(int i=0; i<nb; i++){
            MyObject * mo = new MyObject(typeP[i]);
            SEXP tmpObj = R_MakeExternalPtr(mo, R_NilValue, R_NilValue);
            R_RegisterCFinalizerEx(tmpObj, (R_CFinalizer_t) finalizeMyObject, TRUE);

            classgets(tmpObj, cl);
            SET_VECTOR_ELT(ans, i, tmpObj);//Put in vector
        }
        UNPROTECT(2);
        return ans;
    }

    SEXP printMyObject(SEXP myObjects){
        int nb = length(myObjects);
        for(int i=0; i<nb; i++){
            SEXP moS=VECTOR_ELT(myObjects, i);
            MyObject * mo= static_cast<MyObject *>(R_ExternalPtrAddr(moS));
            Rprintf("%d\n", mo->type);
        }
        return R_NilValue;
    }
}

Once the internal data structure is built, several functions can be called to compute different statistics. The above function printMyObject offer an example of such function.

The problem arises when trying to save this internal structure. It seems to save the pointer addresses. When the objects are reloaded, we get a segfault. Here is an example R code (suppose that myobject.cpp contains the code above and was compiled to myobject.dll)

dyn.load("myobject.dll")
## Create the internal data structure
xx <- .Call("CreateObject", 1:10)
##Everything is fine
.Call("printMyObject", xx)
## Save it, no errors
save(xx, file="xx.RData")
## remove all objects
rm(list=ls())
## Load the data
load(file="xx.RData")
##segfault
.Call("printMyObject", xx)

My question is: what is the best way to handle this correctly? I thought about some strategies, but except for the first one, I do not know how it can be done (and if it is possible to do it):

  • Do not return the internal object. Users always have R structures that are internally converted (for each function call, such as print and so on) to the internal C++ structure. Pros: data can be saved without segfault. Cons: this is very inefficient especially when there are a lot of data.
  • Store the object as R and C++ structure and try to figure out (I do not know how) if the C++ structure is corrupted to rebuild it. Pros: internal data structure is built only once per session. Cons: not the best for memory issues (data stored twice).
  • Use only C++ structure and never save/load (actual solution used). The problem is that there are no error messages, but just a segfault.
  • Find a way to serialize the internal C++ object when R save/load functions are used.

Any idea/suggestions are very welcome.

like image 309
Matthias Studer Avatar asked Aug 27 '13 19:08

Matthias Studer


People also ask

How to save and load R Data workspace files in R?

In this article, we will discuss how to save and load R data workspace files in R programming language. The save.image method in R is used to save the current workspace files. It is an extended version of the save method in R which is used to create a list of all the declared data objects and save them into the workspace.

How to show the internal structure of a list in R?

The str () in R is a built-in function that can show the internal structure of large lists that are nested. The str () method accepts R Object as an argument and returns the internal information about that object.

Why does it take so long for R to load data?

However, if it is a really huge dataset, it could take longer to load it later because R first has to extract the file again. So, if you want to save space, then leave it as it is. If you want to save time, add a parameter . Then, the object is available in your workspace with its old name.

What is a data structure in R?

A data structure is a particular way of organizing data in a computer so that it can be used effectively. The idea is to reduce the space and time complexities of different tasks. Data structures in R programming are tools for holding multiple values.


1 Answers

Finally, I came to 2 solutions to handle correctly the R save/load mechanism with C++ objects (data persistence). None is perfect, but it seems better than doing nothing. Shortly, those solutions are (details are available below):

  • Using R Raw data type to store the C++ objects. This has the advantage of being memory efficient, but it is quite complex in practice. Personally, I won’t use it for a big project, because memory management can be tedious. For sure, there are some possible improvement maybe by using template classes (please share it, if you have an idea).
  • Store the object as R and C++ structure and rebuild the C++ objects after reload. The C++ object are built only once (not for each function call), but the data is stored twice. This has the serious advantage of being much easier to implement on a large project.

To present those solutions, I use a slightly more complicated example in order to highlight the complexity of the first solution. In this example, our example C++ object is a kind of linked list.

First solution: Use R RAW data

The idea (proposed by @KarlForner) is to use the R Raw data type to store the content of the object. In practice this means:

  • Allocating memory with R Raw data type.
  • Be sure to allocate the correct amount of memory (this may be complicated).
  • Use placement new to create the C++ object at the specified memory address (the one with the raw data).

Here is the file “myobject.cpp”:

#include <Rinternals.h>
#include <new>
//Our example Object: A linked list.
class MyRawObject{
    int type;
    // Next object in the list.
    bool hasnext;
    public:
        MyRawObject(int t):type(t), hasnext (false){
            if(t>1){
                //We build a next object.
                hasnext = true;
                //Here we use placement new to build the object in the next allocated memory
                // No need to store a pointer. 
            //We know that this is in the next contiguous location (see memory allocation below)
            new (this+1) MyRawObject(t-1);
            }
        }

        void print(){
            Rprintf(" => %d ", type);
            if(this->hasnext){ 
                //Next object located is in the next contiguous memory block
                (this+1)->print(); //Next in allocated memory
            }
        }
};
extern "C" {
    SEXP CreateRawObject(SEXP type) {
        SEXP ans;
        int nb = length(type);
        PROTECT(ans=allocVector(VECSXP, nb));
        int * typeP = INTEGER(type);
        //When allocating memory, we need to know the size of our object.
        int MOsize =sizeof(MyRawObject);
        for(int i=0; i<nb; i++){
            SEXP rawData;
            //Allocate the memory using R RAW data type.
            //Take care to allocate the right amount of memory
                //Here we have typeP[i] objects to create.
            PROTECT(rawData=allocVector(RAWSXP, typeP[i]*MOsize));
            //Memory address of the allocated memory
            unsigned char * buf = RAW(rawData);
            //Here we use placement new to build the object in the allocated memory
            new (buf) MyRawObject(typeP[i]);
            SET_VECTOR_ELT(ans, i, rawData);//Put in vector
            UNPROTECT(1);
        }
        UNPROTECT(1);
        return ans;
    }
    SEXP printRawObject(SEXP myObjects){
        int nb = length(myObjects);
        for(int i=0; i<nb; i++){
            SEXP moS=VECTOR_ELT(myObjects, i);
            //Use reinterpret_cast to have a pointer of type MyRawObject pointing to the RAW data
            MyRawObject * mo= reinterpret_cast<MyRawObject *>(RAW(moS));
            if(mo!=NULL){
                Rprintf("Address: %d", mo);
                mo->print();
                Rprintf("\n");
            }else{
                Rprintf("Null pointer!\n");
            }
        }
        return R_NilValue;
    }
}

The functions can be used directly in R for instance like that:

## Create the internal data structure
xx <- .Call("CreateRawObject", 1:10)
##Print
.Call("printRawObject", xx)
## Save it
save(xx, file="xxRaw.RData")
## remove all objects
rm(xx)
## Load the data
load(file="xxRaw.RData")
##Works !
.Call("printRawObject", xx)

There are several issues with this solutions:

  • Destructors are not called (depending on what you are doing in the destructor, this may be an issue).
  • If your class contains pointer to other object, all pointers should be relative to memory location in order to make it work after reloading (like here with the linked list).
  • It is quite complex.
  • I am not sure that the saved file (.RData) will be cross-platform (would it be safe to share a RData file accross computers?), because I am not sure that the memory used to store an object will be the same on all platforms.

Second solution: Store the object as R and C++ structure and rebuild the C++ objects after reload.

The idea is to check, in each function call if the C++ pointers are correct. If not, then rebuild the object, otherwise ignore the step. This is made possible because we are directly modifying (in C++) the R object. Hence the change will be effective for all subsequent calls. The advantage is that the C++ objects are built only once (not for each function call), but the data is stored twice. This has the serious advantage of being much easier to implement on a large project.

In the file “myobject.cpp”.

#include <Rinternals.h>
//Our object can be made simpler because we can use pointers
class MyObject{
    int type;
    //Pointer to the next object
    MyObject *next;
    public:
        MyObject(int t):type(t), next(NULL){
            if(t>1){
                next = new MyObject(t-1);
            }
        }
        ~MyObject(){
            if(this->next!=NULL){
                delete next;
            }
        }
        void print(){
            Rprintf(" => %d ", type);
            if(this->next!=NULL){
                this->next->print();
            }
        }
};
void finalizeMyObject(SEXP ptr){
    MyObject * mo= static_cast<MyObject *>(R_ExternalPtrAddr(ptr));
    delete mo;
}

extern "C" {

    SEXP CreateObject(SEXP type) {
        SEXP ans;
        int nb = length(type);
        PROTECT(ans=allocVector(VECSXP, nb));
        int * typeP = INTEGER(type);
        SEXP cl;
        PROTECT(cl = allocVector(STRSXP, 1));
        SET_STRING_ELT(cl, 0, mkChar("myObject"));
        for(int i=0; i<nb; i++){
            MyObject * mo = new MyObject(typeP[i]);
            SEXP tmpObj = R_MakeExternalPtr(mo, R_NilValue, R_NilValue);
            R_RegisterCFinalizerEx(tmpObj, (R_CFinalizer_t) finalizeMyObject, TRUE);
            
            classgets(tmpObj, cl);
            SET_VECTOR_ELT(ans, i, tmpObj);//Put in vector
        }
        UNPROTECT(2);
        return ans;
    }
    
    SEXP printMyObject(SEXP myObjects){
        int nb = length(myObjects);
        for(int i=0; i<nb; i++){
            SEXP moS=VECTOR_ELT(myObjects, i);
            MyObject * mo= static_cast<MyObject *>(R_ExternalPtrAddr(moS));
            if(mo!=NULL){
                Rprintf("Address: %d", mo);
                mo->print();
                Rprintf("\n");
            }else{
                Rprintf("Null pointer!\n");
            }
        }
        return R_NilValue;
    }
    //This function check if a C++ object is NULL, if it is, it rebuilds all the C++ objects.
    SEXP checkRebuildMyObject(SEXP myObjects, SEXP type){
        int nb = length(myObjects);
        if(nb==0){
            return R_NilValue;
        }
        if(R_ExternalPtrAddr(VECTOR_ELT(myObjects, 1))){ //Non corrupted ptrs
            return R_NilValue;
        }
        int * typeP = INTEGER(type);
        SEXP cl;
        PROTECT(cl = allocVector(STRSXP, 1));
        SET_STRING_ELT(cl, 0, mkChar("myObject"));
        for(int i=0; i<nb; i++){
            MyObject * mo = new MyObject(typeP[i]);
            SEXP tmpObj = R_MakeExternalPtr(mo, R_NilValue, R_NilValue);
            R_RegisterCFinalizerEx(tmpObj, (R_CFinalizer_t) finalizeMyObject, TRUE);
            classgets(tmpObj, cl);
            SET_VECTOR_ELT(myObjects, i, tmpObj);//Put in vector
        }
        UNPROTECT(1);
        return R_NilValue;
    }
    
}

In the R side, we can use those functions as follow:

dyn.load("myobject.dll")

CreateObjectR <- function(type){
## We use a custom object type, that store both C++ and R data
type <- as.integer(type)
mo <- list(type=type, myObject= .Call("CreateObject", type))
class(mo) <- "myObjectList"
return(mo)
}

print.myObjectList <- function(x, ...){
## Be sure to check the C++ objects before calling the functions.   
.Call("checkRebuildMyObject", x$myObject, x$type)
.Call("printMyObject", x$myObject)
invisible()
}

xx <- CreateObjectR(1:10)
print(xx)
save(xx, file="xx.RData")
rm(xx)
load(file="xx.RData")
print(xx)
like image 172
Matthias Studer Avatar answered Sep 28 '22 16:09

Matthias Studer