I have a data frame called main that has 400,000 rows and I want to subset it to retrieve 1 or more rows.
As an example here is a data frame which shows the kind of subsetting I am using the subset function:
main <- data.frame(date = as.POSIXct(c("2015-01-01 07:44:00 GMT","2015-02-02 09:46:00 GMT")),
name= c("bob","george"),
value=c(1,522),
id= c(5,2))
subset(main, date == "2015-01-01 07:44:00" & name == "bob" & value == 1)
This works but it is slow and I think it is because I am working with a 400k row data frame. Any ideas how to make subsetting faster?
I'd suggest using a keyed data.table. Here is how to set that up (for a modified example):
require(data.table)
mainDT <- data.table(main)
setkey(mainDT,V1,V2,V3)
We can now subset based on equality conditions using syntax like
mainDT[J("a","A")]
or
mainDT[J(c("a","b"),"A",1)]
which subsets to where V1 %in% c("a","b") (equivalent to V1=="a"|V1=="b").
Here is a speed comparison:
require(rbenchmark)
benchmark(
"[" = main[main$V1=="a" & main$V2=="A",],
"subset" = subset(main,V1=="a" & V2=="A"),
"DT[J()]" = mainDT[J("a","A")],
replications=5
)[,1:6]
which gives these results on my computer:
test replications elapsed relative user.self sys.self
1 [ 5 5.96 NA 5.38 0.57
3 DT[J()] 5 0.00 NA 0.00 0.00
2 subset 5 6.93 NA 6.20 0.72
So, subsetting with J is instant, while the other two methods take several seconds. Subsetting with J in this way is limited, however:
V1=="a" & V3 == 2 using mainDT[J("a",unique(V2),2)] and it's still quite fast.Everything you can do with a data.frame can also be done with a data.table. For example, subset(mainDT,V1=="a" & V2=="A") still works. So there is nothing lost by switching your data.frames to data.tables, generally. You can convert to a data.table in place with setDT(main).
Here is the code for the example:
n = 1e7
n3 = 1e3
set.seed(1)
main <- data.frame(
V1=sample(letters,n,replace=TRUE),
V2=sample(c(letters,LETTERS),n,replace=TRUE),
V3=sample(1:n3,n,replace=TRUE),
V4=rnorm(n))
The improvement seen in the benchmark above will vary with your data. When you have many observations (n) or few unique values for the keys (e.g., n3), the benefit of subsetting with a keyed data.table should be greater.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With