Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Working with rich objects in data.table columns

Tags:

r

data.table

Say I have a data.table in which one column contains linear models:

library(data.table)
set.seed(1014)

dt <- data.table(
  g = c(1, 1, 2, 2, 3, 3, 3),
  x = runif(7),
  y = runif(7)
)

models <- dt[, list(mod = list(lm(y ~ x, data = .SD))), by = g]

Now I want to extract the r-squared value from each model. Can I do better than this?

models[, list(rsq = summary(mod[[1]])$r.squared), by = g]

##    g      rsq
## 1: 1 1.000000
## 2: 2 1.000000
## 3: 3 0.004452

Ideally, I'd like to be able to eliminate the [[1]] and not rely on knowing the previous grouping variable (I know I want each row to be it's own group).

like image 522
hadley Avatar asked Feb 13 '23 06:02

hadley


2 Answers

This is just summary being a bad little function, that's not vectorized. So how about vectorizing it manually (this is roughly the same as @mnel's solution):

r.squared = Vectorize(function(x) summary(x)$r.squared)

models[, rsq := r.squared(mod)]
models
#   g  mod         rsq
#1: 1 <lm> 1.000000000
#2: 2 <lm> 1.000000000
#3: 3 <lm> 0.004451631
like image 121
eddi Avatar answered Feb 15 '23 20:02

eddi


My first thought was to use rapply, with classes='lm', but that does not work. sapply, however does (to my surprise)

library(data.table)
set.seed(1014)

dt <- data.table(
  g = c(1, 1, 2, 2, 3, 3, 3),
  x = runif(7),
  y = runif(7)
)

models <- dt[, list(mod = list(lm(y ~ x, data = .SD))), by = g]
models[, rsq := sapply(mod, function(x) summary(x)$r.squared)]

models
#     g  mod         rsq
#  1: 1 <lm> 1.000000000
#  2: 2 <lm> 1.000000000
#  3: 3 <lm> 0.004451631

"Doing other things" to the model within data.table might be problematic because of the way .SD works as environment.

See Why is using update on a lm inside a grouped data.table losing its model data? for an example of what can occur. This is subject of bug #2590.

like image 38
mnel Avatar answered Feb 15 '23 20:02

mnel