Fastest way to do subsetting in R

Question

I have a data frame called main that has 400,000 rows and I want to subset it to retrieve 1 or more rows.

As an example here is a data frame which shows the kind of subsetting I am using the subset function:

main <- data.frame(date = as.POSIXct(c("2015-01-01 07:44:00 GMT","2015-02-02 09:46:00 GMT")),
                   name= c("bob","george"),
                   value=c(1,522), 
                   id= c(5,2))

subset(main, date == "2015-01-01 07:44:00" & name == "bob" & value == 1)

This works but it is slow and I think it is because I am working with a 400k row data frame. Any ideas how to make subsetting faster?

Frank · Accepted Answer

I'd suggest using a keyed data.table. Here is how to set that up (for a modified example):

require(data.table)
mainDT <- data.table(main)
setkey(mainDT,V1,V2,V3)

We can now subset based on equality conditions using syntax like

mainDT[J("a","A")]

or

mainDT[J(c("a","b"),"A",1)]

which subsets to where V1 %in% c("a","b") (equivalent to V1=="a"|V1=="b").

Here is a speed comparison:

require(rbenchmark)
benchmark(
  "["       = main[main$V1=="a" & main$V2=="A",],
  "subset"  = subset(main,V1=="a" & V2=="A"),
  "DT[J()]" = mainDT[J("a","A")],
  replications=5
)[,1:6]

which gives these results on my computer:

     test replications elapsed relative user.self sys.self
1       [            5    5.96       NA      5.38     0.57
3 DT[J()]            5    0.00       NA      0.00     0.00
2  subset            5    6.93       NA      6.20     0.72

So, subsetting with J is instant, while the other two methods take several seconds. Subsetting with J in this way is limited, however:

It is for equality conditions only.
For the simple syntax above, you need to pass arguments in the order of the key. However, you can select where V1=="a" & V3 == 2 using mainDT[J("a",unique(V2),2)] and it's still quite fast.

Everything you can do with a data.frame can also be done with a data.table. For example, subset(mainDT,V1=="a" & V2=="A") still works. So there is nothing lost by switching your data.frames to data.tables, generally. You can convert to a data.table in place with setDT(main).

Here is the code for the example:

n  = 1e7
n3 = 1e3

set.seed(1)
main <- data.frame(
  V1=sample(letters,n,replace=TRUE),
  V2=sample(c(letters,LETTERS),n,replace=TRUE),
  V3=sample(1:n3,n,replace=TRUE),
  V4=rnorm(n))

The improvement seen in the benchmark above will vary with your data. When you have many observations (n) or few unique values for the keys (e.g., n3), the benefit of subsetting with a keyed data.table should be greater.

Fastest way to do subsetting in R

Tags:

r

user3022875

1 Answers

Frank

Recent Activity

Donate For Us

Fastest way to do subsetting in R

Tags:

r

user3022875

1 Answers

Frank

Related questions

Recent Activity

Donate For Us