I am doing the classic split-apply-recombine thing in R. My data set is a bunch of firms over time. The applying I am doing is running a regression for each firm and returning the residuals, therefore, I am not aggregating by firm. plyr
is great for this but it takes a very very long time to run when the number of firms is large. Is there a way to do this with data.table
?
Sample Data:
dte, id, val1, val2
2001-10-02, 1, 10, 25
2001-10-03, 1, 11, 24
2001-10-04, 1, 12, 23
2001-10-02, 2, 13, 22
2001-10-03, 2, 14, 21
I need to split by each id (namely 1 and 2). Run a regression, return the residuals and append it as a column to my data. Is there a way to do this using data.table
?
DWin's answer is correct for v1.8.0 (as currently on CRAN). But in v1.8.1 (on R-Forge repository), :=
now works by group. It works for non-contiguous groups too so there is no need to setkey
first for it to line up.
dtb <- as.data.table(dat)
dtb
dte id val1 val2
1: 2001-10-02 1 10 25
2: 2001-10-03 1 11 24
3: 2001-10-04 1 12 23
4: 2001-10-02 2 13 22
5: 2001-10-03 2 14 21
dtb[, resid:=residuals(lm(val1 ~ val2)), by=id]
dte id val1 val2 resid
1: 2001-10-02 1 10 25 1.631688e-15
2: 2001-10-03 1 11 24 -3.263376e-15
3: 2001-10-04 1 12 23 1.631688e-15
4: 2001-10-02 2 13 22 0.000000e+00
5: 2001-10-03 2 14 21 0.000000e+00
To upgrade to v1.8.1 just install from the R-Forge repo. (R 2.15.0+ is needed when installing any binary package from R-Forge) :
install.packages("data.table", repos="http://R-Forge.R-project.org")
or install from source if you can't upgrade to latest R. data.table
itself only needs R 2.12.0+.
Extending to the 1MM case :
DT = data.table(dte=Sys.Date()+1:1000000,
id=sample(1:2, 1000000, repl=TRUE),
val1=runif(1000000), val2=runif(1000000) )
setkey(DT, id)
system.time(ans1 <- cbind(DT, DT[, residuals(lm(val1 ~ val2)), by="id"]) )
user system elapsed
12.272 0.872 13.182
ans1
dte id val1 val2 id V1
1: 2012-07-02 1 0.8369147 0.57553383 1 0.336647598
2: 2012-07-05 1 0.0109102 0.02532214 1 -0.488633325
3: 2012-07-06 1 0.4977762 0.16607786 1 -0.001952414
---
999998: 4750-05-27 2 0.1296722 0.62645838 2 -0.370627034
999999: 4750-05-28 2 0.2686352 0.04890710 2 -0.231952238
1000000: 4750-05-29 2 0.9981029 0.91626787 2 0.497948275
system.time(DT[, resid:=residuals(lm(val1 ~ val2)), by=id])
user system elapsed
7.436 0.648 8.107
DT
dte id val1 val2 resid
1: 2012-07-02 1 0.8369147 0.57553383 0.336647598
2: 2012-07-05 1 0.0109102 0.02532214 -0.488633325
3: 2012-07-06 1 0.4977762 0.16607786 -0.001952414
---
999998: 4750-05-27 2 0.1296722 0.62645838 -0.370627034
999999: 4750-05-28 2 0.2686352 0.04890710 -0.231952238
1000000: 4750-05-29 2 0.9981029 0.91626787 0.497948275
The example above only has 2 groups, is quite small at under 40MB, and Rprof
shows 96% of the time is spent in lm
. So in these cases :=
by group is not for a speed advantage really, but more for the convenience; i.e., less code needed to write and no superfluous columns added to the output. As size grows, the avoidance of copies comes into it and speed advantages start to show. Especially, transform
in j
will slow down terribly as the number of groups increases.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With