I'm wondering if there's any documentation about the efficiency of operations in <code>R</code>, specifically those related to data manipulation. For example: <ul> <li>I imagine it's efficient to add columns to a data frame, because I'm guessing you're just adding an element to a linked list. </li> <li>I imagine adding rows is slower because vectors are held in arrays at the <code>C level</code> and you have to allocate a new array of length <code>n+1</code> and copy all the elements over.</li> </ul> The developers probably don't want to tie themselves to a particular implementation, but it would be nice to have something more solid than guesses to go on. Also, I know the main <code>R</code> performance hint is to use vectored operations whenever possible as opposed to <code>loops</code>. <ul> <li>what about the various flavors of <code>apply</code>? </li> <li>are those just <code>hidden loops</code>? </li> <li>what about <code>matrices</code> vs. <code>data frames</code>?</li> </ul>

Data IO was one of the features i looked into before i committed to learning R. For better or worse, here are my observations and solutions/palliatives on these issues: 1. That R doesn't handle big data (>2 GB?) To me this is a misnomer. By default, the common data input functions load your data into RAM. Not to be glib, but to me, this is a feature not a bug--anytime my data will fit in my available RAM, that's where i want it. Likewise, one of SQLite's most popular features is the in-memory option--the user has the easy option of loading the entire dB into RAM. If your data won't fit in memory, then R makes it astonishingly easy to persist it, via connections to the common RDBMS systems (RODBC, RSQLite, RMySQL, etc.), via no-frills options like the filehash package, and via systems that current technology/practices (for instance, i can recommend ff). In other words, the R developers have chosen a sensible (and probably optimal) default, from which it is very easy to opt out. 2. The performance of read.table (read.csv, read.delim, et al.), the most common means for getting data into R, can be improved 5x (and often much more in my experience) just by opting out of a few of read.table's default arguments--the ones having the greatest effect on performance are mentioned in the R's Help (?read.table). Briefly, the R Developers tell us that if you provide values for the parameters 'colClasses', 'nrows', 'sep', and 'comment.char' (in particular, pass in '' if you know your file begins with headers or data on line 1), you'll see a significant performance gain. I've found that to be true. Here are the snippets i use for those parameters: To get the number of rows in your data file (supply this snippet as an argument to the parameter, 'nrows', in your call to read.table): <pre class="prettyprint"><code>as.numeric((gsub("[^0-9]+", "", system(paste("wc -l ", file_name, sep=""), intern=T)))) </code></pre> To get the classes for each column: <pre class="prettyprint"><code>function(fname){sapply(read.table(fname, header=T, nrows=5), class)} </code></pre> Note: You can't pass this snippet in as an argument, you have to call it first, then pass in the value returned--in other words, call the function, bind the returned value to a variable, and then pass in the variable as the value to to the parameter 'colClasses' in your call to read.table: 3. Using Scan. With only a little more hassle, you can do better than that (optimizing 'read.table') by using 'scan' instead of 'read.table' ('read.table' is actually just a wrapper around 'scan'). Once again, this is very easy to do. I use 'scan' to input each column individually then build my data.frame inside R, i.e., df = data.frame(cbind(col1, col2,....)). 4. Use R's Containers for persistence in place of ordinary file formats (e.g., 'txt', 'csv'). R's native data file '.RData' is a binary format that a little smaller than a compressed ('.gz') txt data file. You create them using save(, ). You load it back into the R namespace with load(). The difference in load times compared with 'read.table' is dramatic. For instance, w/ a 25 MB file (uncompressed size) <pre class="prettyprint"><code>system.time(read.table("tdata01.txt.gz", sep=",")) => user system elapsed 6.173 0.245 **6.450** system.time(load("tdata01.RData")) => user system elapsed 0.912 0.006 **0.912** </code></pre> 5. Paying attention to data types can often give you a performance boost and reduce your memory footprint. This point is probably more useful in getting data out of R. The key point to keep in mind here is that by default, numbers in R expressions are interpreted as double-precision floating point, e.g., > typeof(5) returns "double." Compare the object size of a reasonable-sized array of each and you can see the significance (use object.size()). So coerce to integer when you can. Finally, the 'apply' family of functions (among others) are not "hidden loops" or loop wrappers. They are loops implemented in C--big difference performance-wise. [edit: AWB has correctly pointed out that while 'sapply', 'tapply', and 'mapply' are implemented in C, 'apply' is simply a wrapper function.

Efficiency of operations on R data structures

1 Answers

Data IO was one of the features i looked into before i committed to learning R. For better or worse, here are my observations and solutions/palliatives on these issues:

1. That R doesn't handle big data (>2 GB?) To me this is a misnomer. By default, the common data input functions load your data into RAM. Not to be glib, but to me, this is a feature not a bug--anytime my data will fit in my available RAM, that's where i want it. Likewise, one of SQLite's most popular features is the in-memory option--the user has the easy option of loading the entire dB into RAM. If your data won't fit in memory, then R makes it astonishingly easy to persist it, via connections to the common RDBMS systems (RODBC, RSQLite, RMySQL, etc.), via no-frills options like the filehash package, and via systems that current technology/practices (for instance, i can recommend ff). In other words, the R developers have chosen a sensible (and probably optimal) default, from which it is very easy to opt out.

2. The performance of read.table (read.csv, read.delim, et al.), the most common means for getting data into R, can be improved 5x (and often much more in my experience) just by opting out of a few of read.table's default arguments--the ones having the greatest effect on performance are mentioned in the R's Help (?read.table). Briefly, the R Developers tell us that if you provide values for the parameters 'colClasses', 'nrows', 'sep', and 'comment.char' (in particular, pass in '' if you know your file begins with headers or data on line 1), you'll see a significant performance gain. I've found that to be true.

Here are the snippets i use for those parameters:

To get the number of rows in your data file (supply this snippet as an argument to the parameter, 'nrows', in your call to read.table):

as.numeric((gsub("[^0-9]+", "", system(paste("wc -l ", file_name, sep=""), intern=T))))

To get the classes for each column:

function(fname){sapply(read.table(fname, header=T, nrows=5), class)}

Note: You can't pass this snippet in as an argument, you have to call it first, then pass in the value returned--in other words, call the function, bind the returned value to a variable, and then pass in the variable as the value to to the parameter 'colClasses' in your call to read.table:

3. Using Scan. With only a little more hassle, you can do better than that (optimizing 'read.table') by using 'scan' instead of 'read.table' ('read.table' is actually just a wrapper around 'scan'). Once again, this is very easy to do. I use 'scan' to input each column individually then build my data.frame inside R, i.e., df = data.frame(cbind(col1, col2,....)).

4. Use R's Containers for persistence in place of ordinary file formats (e.g., 'txt', 'csv'). R's native data file '.RData' is a binary format that a little smaller than a compressed ('.gz') txt data file. You create them using save(, ). You load it back into the R namespace with load(). The difference in load times compared with 'read.table' is dramatic. For instance, w/ a 25 MB file (uncompressed size)

system.time(read.table("tdata01.txt.gz", sep=",")) =>  user  system elapsed      6.173   0.245   **6.450**   system.time(load("tdata01.RData")) => user  system elapsed      0.912   0.006   **0.912**

5. Paying attention to data types can often give you a performance boost and reduce your memory footprint. This point is probably more useful in getting data out of R. The key point to keep in mind here is that by default, numbers in R expressions are interpreted as double-precision floating point, e.g., > typeof(5) returns "double." Compare the object size of a reasonable-sized array of each and you can see the significance (use object.size()). So coerce to integer when you can.

Finally, the 'apply' family of functions (among others) are not "hidden loops" or loop wrappers. They are loops implemented in C--big difference performance-wise. [edit: AWB has correctly pointed out that while 'sapply', 'tapply', and 'mapply' are implemented in C, 'apply' is simply a wrapper function.

175

answered Sep 24 '22 12:09

doug

Related questions
                            
                                Using Roxygen2 Template tags
                            
                                data.table join then add columns to existing data.frame without re-copy
                            
                                List files in R that do NOT match a pattern
                            
                                Handling missing/incomplete data in R--is there function to mask but not remove NAs?
                            
                                Package inputenc Error: Unicode char \u8 in RStudio
                            
                                Change arrowhead of arrows()
                            
                                Applying a function to two lists?
                            
                                Remove legend entries for some factors levels
                            
                                How to split Shiny app code over multiple files in RStudio? [closed]
                            
                                R - ordering in boxplot
                            
                                Why are Xs added to data frame variable names when using read.csv?
                            
                                Chi-Squared test in Python
                            
                                How to convert integer into categorical data in R?
                            
                                readRDS(file) in R
                            
                                What are the disadvantages of using .Rdata files compared to HDF5 or netCDF?
                            
                                Display HTML file in Shiny App
                            
                                Disable/suppress tcltk popup for CRAN mirror selection in R
                            
                                how do i exclude specific variables from a glm in R?
                            
                                Thousand separator in label of x or y axis
                            
                                Using lists inside data.table columns

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficiency of operations on R data structures

Tags:

performance

r

cbare

People also ask

1 Answers

doug

Recent Activity

Donate For Us