Is there a way to select a subset from objects (data frames, matrices, vectors) without making a copy of selected data?
I work with quite large data sets, but never change them. However often for convenience I select subsets of the data to operate on. Making a copy of a large subset each time is very memory inefficient, but both normal indexing and subset
(and thus xapply()
family of functions) create copies of selected data. So I'm looking for functions or data structures that can overcome this issue.
Some possible approaches that may fit my needs and hopefully are implemented in some R packages:
xapply()
analogues that do not create subsets.To specify a logical expression for the rows parameter, use the standard R operators. If subsetting is done by only rows or only columns, then leave the other value blank. For example, to subset the d data frame only by rows, the general form reduces to d[rows,] . Similarly, to subset only by columns, d[,cols] .
To select a specific column, you can also type in the name of the dataframe, followed by a $ , and then the name of the column you are looking to select. In this example, we will be selecting the payment column of the dataframe. When running this script, R will simplify the result as a vector.
The way you tell R that you want to select some particular elements (i.e., a 'subset') from a vector is by placing an 'index vector' in square brackets immediately following the name of the vector. For a simple example, try x[1:10] to view the first ten elements of x.
Try package ref. Specifically, its refdata
class.
What you might be missing about data.table
is that when grouping (by=
parameter) the subsets of data are not copied, so that's fast. [Well technically they are but into a shared area of memory which is reused for each group, and copied using memcpy which is much faster than R's for loops in C.]
:=
in data.table
is one way to modify a data.table
in place. data.table
departs from usual R programming style in that it is not copied-on-write. User has to call copy()
explicitly to copy a (potentially very large) table, even within a function.
You're right that there isn't a mechanism like refdata
built into data.table
. I see what you mean and it would be a nice feature. refdata
should work on a data.table
, though, and you might be fine with data.frame
(but be sure to monitor copies with tracemem(DF)
).
There is also idata.frame
(immutable data.frame
) in package plyr
you could try.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With