I've got a dataframe dat of size 30000 x 50. I also have a separate list that contains points to groupings of rows from this dataframe, e.g.,
rows <- list(c("34", "36", "39"), c("45", "46"))
This says that dataframe rows with rownames (not numeric row indeces, but character rownames(dat)) "34", "36", "39" constitute one grouping, and "45", "46" constitute another grouping.
Now I want to pull out the groupings from the dataframe into a parallel list, but my code (below) is really, really slow. How can I speed it up?
> system.time(lapply(rows, function(r) {dat[r, ]}))
user system elapsed
246.09 0.01 247.23
That's on a very fast computer, R 2.14.1 x64.
One of the main issues is the matching of row names -- the default in [.data.frame
is partial matching of row names and you probably don't want that, so you're better off with match
. To speed it up even further you can use fmatch
from fastmatch
if you want. This is a minor modification with some speedup:
# naive
> system.time(res1 <- lapply(rows,function(r) dat[r,]))
user system elapsed
69.207 5.545 74.787
# match
> rn <- rownames(dat)
> system.time(res1 <- lapply(rows,function(r) dat[match(r,rn),]))
user system elapsed
36.810 10.003 47.082
# fastmatch
> rn <- rownames(dat)
> system.time(res1 <- lapply(rows,function(r) dat[fmatch(r,rn),]))
user system elapsed
19.145 3.012 22.226
You can get further speed up by not using [
(it is slow for data frames) but splitting the data frame (using split
) if your rows
are non-overlapping and cover all rows (and thus you can map each row to one entry in rows).
Depending on your actual data you may be better off with matrices that have by far faster subsetting operators since they are native.
My original post started with this erroneous statement:
The problem with indexing via
rownames
andcolnames
is that you are running a vector/linear scan for each element, eg. you are hunting through each row to see which is named "36", then starting from the beginning to do it again for "34".
Simon pointed out in the comments here that R apparently uses a hash table for indexing. Sorry for the mistake.
Note that the suggestions in this answer assume that you have non-overlapping subsets of data.
If you want to keep your list-lookup strategy, I'd suggest storing the actual row indices in stead of string names.
An alternative is to store your "group" information as another column to your data.frame
, then split
your data.frame
on its group, eg. let's say your recoded data.frame
looks like this:
dat <- data.frame(a=sample(100, 10),
b=rnorm(10),
group=sample(c('a', 'b', 'c'), 10, replace=TRUE))
You could then do:
split(dat, dat$group)
$a
a b group
2 66 -0.08721261 a
9 62 -1.34114792 a
$b
a b group
1 32 0.9719442 b
5 79 -1.0204179 b
6 83 -1.7645829 b
7 73 0.4261097 b
10 44 -0.1160913 b
$c
a b group
3 77 0.2313654 c
4 74 -0.8637770 c
8 29 1.0046095 c
Or, depending on what you really want to do with your "splits", you can convert your data.frame
to a data.table
and set its key to your new group
column:
library(data.table)
dat <- data.table(dat, key="group")
Now do your list thing -- which will give you the same result as the split
above
x <- lapply(unique(dat$group), function(g) dat[J(g),])
But you probably want to "work over your spits", and you can do that inline, eg:
ans <- dat[, {
## do some code over the data in each split
## and return a list of results, eg:
list(nrow=length(a), mean.a=mean(a), mean.b=mean(b))
}, by="group"]
ans
group nrow mean.a mean.b
[1,] a 2 64.0 -0.7141803
[2,] b 5 62.2 -0.3006076
[3,] c 3 60.0 0.1240660
You can do the last step in "a similar fashion" with plyr
, eg:
library(plyr)
ddply(dat, "group", summarize, nrow=length(a), mean.a=mean(a),
mean.b=mean(b))
group nrow mean.a mean.b
1 a 2 64.0 -0.7141803
2 b 5 62.2 -0.3006076
3 c 3 60.0 0.1240660
But since you mention your dataset is quite large, I think you'd like the speed boost data.table
will provide.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With