Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

fast subsetting in R

Tags:

dataframe

r

I've got a dataframe dat of size 30000 x 50. I also have a separate list that contains points to groupings of rows from this dataframe, e.g.,

rows <- list(c("34", "36", "39"), c("45", "46"))

This says that dataframe rows with rownames (not numeric row indeces, but character rownames(dat)) "34", "36", "39" constitute one grouping, and "45", "46" constitute another grouping.

Now I want to pull out the groupings from the dataframe into a parallel list, but my code (below) is really, really slow. How can I speed it up?

> system.time(lapply(rows, function(r) {dat[r, ]}))
   user  system elapsed 
 246.09    0.01  247.23 

That's on a very fast computer, R 2.14.1 x64.

like image 753
Jack Tanner Avatar asked Jan 20 '12 03:01

Jack Tanner


2 Answers

One of the main issues is the matching of row names -- the default in [.data.frame is partial matching of row names and you probably don't want that, so you're better off with match. To speed it up even further you can use fmatch from fastmatch if you want. This is a minor modification with some speedup:

# naive
> system.time(res1 <- lapply(rows,function(r) dat[r,]))
   user  system elapsed 
 69.207   5.545  74.787 

# match
> rn <- rownames(dat)
> system.time(res1 <- lapply(rows,function(r) dat[match(r,rn),]))
   user  system elapsed 
 36.810  10.003  47.082 

# fastmatch
> rn <- rownames(dat)
> system.time(res1 <- lapply(rows,function(r) dat[fmatch(r,rn),]))
   user  system elapsed 
 19.145   3.012  22.226 

You can get further speed up by not using [ (it is slow for data frames) but splitting the data frame (using split) if your rows are non-overlapping and cover all rows (and thus you can map each row to one entry in rows).

Depending on your actual data you may be better off with matrices that have by far faster subsetting operators since they are native.

like image 188
Simon Urbanek Avatar answered Sep 18 '22 17:09

Simon Urbanek


Update

My original post started with this erroneous statement:

The problem with indexing via rownames and colnames is that you are running a vector/linear scan for each element, eg. you are hunting through each row to see which is named "36", then starting from the beginning to do it again for "34".

Simon pointed out in the comments here that R apparently uses a hash table for indexing. Sorry for the mistake.

Original Answer

Note that the suggestions in this answer assume that you have non-overlapping subsets of data.

If you want to keep your list-lookup strategy, I'd suggest storing the actual row indices in stead of string names.

An alternative is to store your "group" information as another column to your data.frame, then split your data.frame on its group, eg. let's say your recoded data.frame looks like this:

dat <- data.frame(a=sample(100, 10),
                  b=rnorm(10),
                  group=sample(c('a', 'b', 'c'), 10, replace=TRUE))

You could then do:

split(dat, dat$group)
$a
   a           b group
2 66 -0.08721261     a
9 62 -1.34114792     a

$b
    a          b group
1  32  0.9719442     b
5  79 -1.0204179     b
6  83 -1.7645829     b
7  73  0.4261097     b
10 44 -0.1160913     b

$c
   a          b group
3 77  0.2313654     c
4 74 -0.8637770     c
8 29  1.0046095     c

Or, depending on what you really want to do with your "splits", you can convert your data.frame to a data.table and set its key to your new group column:

library(data.table)
dat <- data.table(dat, key="group")

Now do your list thing -- which will give you the same result as the split above

 x <- lapply(unique(dat$group), function(g) dat[J(g),])

But you probably want to "work over your spits", and you can do that inline, eg:

ans <- dat[, {
  ## do some code over the data in each split
  ## and return a list of results, eg:
  list(nrow=length(a), mean.a=mean(a), mean.b=mean(b))
}, by="group"]

ans
     group nrow mean.a     mean.b
[1,]     a    2   64.0 -0.7141803
[2,]     b    5   62.2 -0.3006076
[3,]     c    3   60.0  0.1240660

You can do the last step in "a similar fashion" with plyr, eg:

library(plyr)
ddply(dat, "group", summarize, nrow=length(a), mean.a=mean(a),
      mean.b=mean(b))
  group nrow mean.a     mean.b
1     a    2   64.0 -0.7141803
2     b    5   62.2 -0.3006076
3     c    3   60.0  0.1240660

But since you mention your dataset is quite large, I think you'd like the speed boost data.table will provide.

like image 21
Steve Lianoglou Avatar answered Sep 18 '22 17:09

Steve Lianoglou