Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiency of operations on R data structures

Tags:

performance

r

I'm wondering if there's any documentation about the efficiency of operations in R, specifically those related to data manipulation.

For example:

  • I imagine it's efficient to add columns to a data frame, because I'm guessing you're just adding an element to a linked list.
  • I imagine adding rows is slower because vectors are held in arrays at the C level and you have to allocate a new array of length n+1 and copy all the elements over.

The developers probably don't want to tie themselves to a particular implementation, but it would be nice to have something more solid than guesses to go on.

Also, I know the main R performance hint is to use vectored operations whenever possible as opposed to loops.

  • what about the various flavors of apply?
  • are those just hidden loops?
  • what about matrices vs. data frames?
like image 655
cbare Avatar asked Dec 28 '09 20:12

cbare


People also ask

Which data structures in R are the most used why?

Vectors. A vector is the most common and basic data structure in R and is pretty much the workhorse of R. Technically, vectors can be one of two types: atomic vectors.

What is the key advantage of using a data frame instead of a matrix in R?

Data frames can do lot of works like fit statistics formulas. Processing data(Not possible with Matrix, First converting to Data Frame is mandatory). Transpose is possible, i.e. changing rows to columns and vice versa which is useful in Data Science.


1 Answers

Data IO was one of the features i looked into before i committed to learning R. For better or worse, here are my observations and solutions/palliatives on these issues:

1. That R doesn't handle big data (>2 GB?) To me this is a misnomer. By default, the common data input functions load your data into RAM. Not to be glib, but to me, this is a feature not a bug--anytime my data will fit in my available RAM, that's where i want it. Likewise, one of SQLite's most popular features is the in-memory option--the user has the easy option of loading the entire dB into RAM. If your data won't fit in memory, then R makes it astonishingly easy to persist it, via connections to the common RDBMS systems (RODBC, RSQLite, RMySQL, etc.), via no-frills options like the filehash package, and via systems that current technology/practices (for instance, i can recommend ff). In other words, the R developers have chosen a sensible (and probably optimal) default, from which it is very easy to opt out.

2. The performance of read.table (read.csv, read.delim, et al.), the most common means for getting data into R, can be improved 5x (and often much more in my experience) just by opting out of a few of read.table's default arguments--the ones having the greatest effect on performance are mentioned in the R's Help (?read.table). Briefly, the R Developers tell us that if you provide values for the parameters 'colClasses', 'nrows', 'sep', and 'comment.char' (in particular, pass in '' if you know your file begins with headers or data on line 1), you'll see a significant performance gain. I've found that to be true.

Here are the snippets i use for those parameters:

To get the number of rows in your data file (supply this snippet as an argument to the parameter, 'nrows', in your call to read.table):

as.numeric((gsub("[^0-9]+", "", system(paste("wc -l ", file_name, sep=""), intern=T)))) 

To get the classes for each column:

function(fname){sapply(read.table(fname, header=T, nrows=5), class)}   

Note: You can't pass this snippet in as an argument, you have to call it first, then pass in the value returned--in other words, call the function, bind the returned value to a variable, and then pass in the variable as the value to to the parameter 'colClasses' in your call to read.table:

3. Using Scan. With only a little more hassle, you can do better than that (optimizing 'read.table') by using 'scan' instead of 'read.table' ('read.table' is actually just a wrapper around 'scan'). Once again, this is very easy to do. I use 'scan' to input each column individually then build my data.frame inside R, i.e., df = data.frame(cbind(col1, col2,....)).

4. Use R's Containers for persistence in place of ordinary file formats (e.g., 'txt', 'csv'). R's native data file '.RData' is a binary format that a little smaller than a compressed ('.gz') txt data file. You create them using save(, ). You load it back into the R namespace with load(). The difference in load times compared with 'read.table' is dramatic. For instance, w/ a 25 MB file (uncompressed size)

system.time(read.table("tdata01.txt.gz", sep=",")) =>  user  system elapsed      6.173   0.245   **6.450**   system.time(load("tdata01.RData")) => user  system elapsed      0.912   0.006   **0.912**    

5. Paying attention to data types can often give you a performance boost and reduce your memory footprint. This point is probably more useful in getting data out of R. The key point to keep in mind here is that by default, numbers in R expressions are interpreted as double-precision floating point, e.g., > typeof(5) returns "double." Compare the object size of a reasonable-sized array of each and you can see the significance (use object.size()). So coerce to integer when you can.

Finally, the 'apply' family of functions (among others) are not "hidden loops" or loop wrappers. They are loops implemented in C--big difference performance-wise. [edit: AWB has correctly pointed out that while 'sapply', 'tapply', and 'mapply' are implemented in C, 'apply' is simply a wrapper function.

like image 175
doug Avatar answered Sep 24 '22 12:09

doug