Say I have a data.table in which one column contains linear models:
library(data.table)
set.seed(1014)
dt <- data.table(
g = c(1, 1, 2, 2, 3, 3, 3),
x = runif(7),
y = runif(7)
)
models <- dt[, list(mod = list(lm(y ~ x, data = .SD))), by = g]
Now I want to extract the r-squared value from each model. Can I do better than this?
models[, list(rsq = summary(mod[[1]])$r.squared), by = g]
## g rsq
## 1: 1 1.000000
## 2: 2 1.000000
## 3: 3 0.004452
Ideally, I'd like to be able to eliminate the [[1]]
and not rely on
knowing the previous grouping variable (I know I want each row to be
it's own group).
This is just summary
being a bad little function, that's not vectorized. So how about vectorizing it manually (this is roughly the same as @mnel's solution):
r.squared = Vectorize(function(x) summary(x)$r.squared)
models[, rsq := r.squared(mod)]
models
# g mod rsq
#1: 1 <lm> 1.000000000
#2: 2 <lm> 1.000000000
#3: 3 <lm> 0.004451631
My first thought was to use rapply
, with classes='lm'
, but that does not work. sapply
, however does (to my surprise)
library(data.table)
set.seed(1014)
dt <- data.table(
g = c(1, 1, 2, 2, 3, 3, 3),
x = runif(7),
y = runif(7)
)
models <- dt[, list(mod = list(lm(y ~ x, data = .SD))), by = g]
models[, rsq := sapply(mod, function(x) summary(x)$r.squared)]
models
# g mod rsq
# 1: 1 <lm> 1.000000000
# 2: 2 <lm> 1.000000000
# 3: 3 <lm> 0.004451631
"Doing other things" to the model within data.table
might be problematic because of the way .SD
works as environment.
See Why is using update on a lm inside a grouped data.table losing its model data? for an example of what can occur. This is subject of bug #2590.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With