Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: selecting subset without copying

Is there a way to select a subset from objects (data frames, matrices, vectors) without making a copy of selected data?

I work with quite large data sets, but never change them. However often for convenience I select subsets of the data to operate on. Making a copy of a large subset each time is very memory inefficient, but both normal indexing and subset (and thus xapply() family of functions) create copies of selected data. So I'm looking for functions or data structures that can overcome this issue.

Some possible approaches that may fit my needs and hopefully are implemented in some R packages:

  • copy-on-write mechanism, i.e. data structures that are copied only when you add or rewrite existing elements;
  • immutable data structures, that only require recreating indexing information for the data structure, but not its content (like making substring from the string by only creating small object that holds length and a pointer to the same char array);
  • xapply() analogues that do not create subsets.
like image 566
ffriend Avatar asked Mar 05 '12 19:03

ffriend


People also ask

How do you specify a subset in R?

To specify a logical expression for the rows parameter, use the standard R operators. If subsetting is done by only rows or only columns, then leave the other value blank. For example, to subset the d data frame only by rows, the general form reduces to d[rows,] . Similarly, to subset only by columns, d[,cols] .

How do I select specific data in R?

To select a specific column, you can also type in the name of the dataframe, followed by a $ , and then the name of the column you are looking to select. In this example, we will be selecting the payment column of the dataframe. When running this script, R will simplify the result as a vector.

How do you get a subset of a vector in R?

The way you tell R that you want to select some particular elements (i.e., a 'subset') from a vector is by placing an 'index vector' in square brackets immediately following the name of the vector. For a simple example, try x[1:10] to view the first ten elements of x.


1 Answers

Try package ref. Specifically, its refdata class.

What you might be missing about data.table is that when grouping (by= parameter) the subsets of data are not copied, so that's fast. [Well technically they are but into a shared area of memory which is reused for each group, and copied using memcpy which is much faster than R's for loops in C.]

:= in data.table is one way to modify a data.table in place. data.table departs from usual R programming style in that it is not copied-on-write. User has to call copy() explicitly to copy a (potentially very large) table, even within a function.

You're right that there isn't a mechanism like refdata built into data.table. I see what you mean and it would be a nice feature. refdata should work on a data.table, though, and you might be fine with data.frame (but be sure to monitor copies with tracemem(DF)).

There is also idata.frame (immutable data.frame) in package plyr you could try.

like image 55
Matt Dowle Avatar answered Sep 23 '22 15:09

Matt Dowle