Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating an R dataframe row-by-row

Tags:

list

dataframe

r

You can grow them row by row by appending or using rbind().

That does not mean you should. Dynamically growing structures is one of the least efficient ways to code in R.

If you can, allocate your entire data.frame up front:

N <- 1e4  # total number of rows to preallocate--possibly an overestimate

DF <- data.frame(num=rep(NA, N), txt=rep("", N),  # as many cols as you need
                 stringsAsFactors=FALSE)          # you don't know levels yet

and then during your operations insert row at a time

DF[i, ] <- list(1.4, "foo")

That should work for arbitrary data.frame and be much more efficient. If you overshot N you can always shrink empty rows out at the end.


One can add rows to NULL:

df<-NULL;
while(...){
  #Some code that generates new row
  rbind(df,row)->df
}

for instance

df<-NULL
for(e in 1:10) rbind(df,data.frame(x=e,square=e^2,even=factor(e%%2==0)))->df
print(df)

This is a silly example of how to use do.call(rbind,) on the output of Map() [which is similar to lapply()]

> DF <- do.call(rbind,Map(function(x) data.frame(a=x,b=x+1),x=1:3))
> DF
  x y
1 1 2
2 2 3
3 3 4
> class(DF)
[1] "data.frame"

I use this construct quite often.


The reason I like Rcpp so much is that I don't always get how R Core thinks, and with Rcpp, more often than not, I don't have to.

Speaking philosophically, you're in a state of sin with regards to the functional paradigm, which tries to ensure that every value appears independent of every other value; changing one value should never cause a visible change in another value, the way you get with pointers sharing representation in C.

The problems arise when functional programming signals the small craft to move out of the way, and the small craft replies "I'm a lighthouse". Making a long series of small changes to a large object which you want to process on in the meantime puts you square into lighthouse territory.

In the C++ STL, push_back() is a way of life. It doesn't try to be functional, but it does try to accommodate common programming idioms efficiently.

With some cleverness behind the scenes, you can sometimes arrange to have one foot in each world. Snapshot based file systems are a good example (which evolved from concepts such as union mounts, which also ply both sides).

If R Core wanted to do this, underlying vector storage could function like a union mount. One reference to the vector storage might be valid for subscripts 1:N, while another reference to the same storage is valid for subscripts 1:(N+1). There could be reserved storage not yet validly referenced by anything but convenient for a quick push_back(). You don't violate the functional concept when appending outside the range that any existing reference considers valid.

Eventually appending rows incrementally, you run out of reserved storage. You'll need to create new copies of everything, with the storage multiplied by some increment. The STL implementations I've use tend to multiply storage by 2 when extending allocation. I thought I read in R Internals that there is a memory structure where the storage increments by 20%. Either way, growth operations occur with logarithmic frequency relative to the total number of elements appended. On an amortized basis, this is usually acceptable.

As tricks behind the scenes go, I've seen worse. Every time you push_back() a new row onto the dataframe, a top level index structure would need to be copied. The new row could append onto shared representation without impacting any old functional values. I don't even think it would complicate the garbage collector much; since I'm not proposing push_front() all references are prefix references to the front of the allocated vector storage.


I've found this way to create dataframe by raw without matrix.

With automatic column name

df<-data.frame(
        t(data.frame(c(1,"a",100),c(2,"b",200),c(3,"c",300)))
        ,row.names = NULL,stringsAsFactors = FALSE
    )

With column name

df<-setNames(
        data.frame(
            t(data.frame(c(1,"a",100),c(2,"b",200),c(3,"c",300)))
            ,row.names = NULL,stringsAsFactors = FALSE
        ), 
        c("col1","col2","col3")
    )