Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regression and summary statistics by group within a data.table

I would like to calculate some summary statistics and perform different regressions by group within a data table, and have the results in "wide" format (i.e. one row per group with several columns). I can do it in multiple steps, but it seems like it should be possible to do all at once.

Consider this example data:

set.seed=46984
dt <- data.table(ID=c(rep('Frank',5),rep('Tony',5),rep('Ed',5)), y=rnorm(15), x=rnorm(15), z=rnorm(15),key="ID")
dt
#       ID          y          x            z
# 1:    Ed  0.2129400 -0.3024061  0.845335632
# 2:    Ed  0.4850342 -0.5159197 -0.087965415
# 3:    Ed  1.8917489  1.7803220  0.760465271
# 4:    Ed -0.4330460 -2.1720944  0.973812545
# 5:    Ed  0.7685060  0.7947470  1.279761200
# 6: Frank  0.4978475 -0.2906851  0.568101004
# 7: Frank  0.6323386 -0.5596599  1.537133025
# 8: Frank -0.8243218 -0.4354885  0.057818033
# 9: Frank  1.2402488  0.3229422  0.005995249
#10: Frank  0.2436210 -0.2651422  0.349532173
#11:  Tony  0.4179568  0.1418463  0.142380549
#12:  Tony  0.7036613  0.4402572  0.141237901
#13:  Tony -0.1978720 -0.9553784  0.480425820
#14:  Tony -1.7269375 -0.1881292  0.370583351
#15:  Tony  1.1064903  0.4375014 -0.798221750

Let's say I want to get the median by ID, perform linear regression on y ~ x by ID, and perform linear regression on y ~ x + z by ID. Here I get the median:

dt.med <- dt[,list(y.med=median(y)),by=ID]
dt.med
#      ID     y.med
#1:    Ed 0.4850342
#2: Frank 0.4978475
#3:  Tony 0.4179568

And thanks to this answer by @DWin, here I get the two individual sets of regression coefficients as columns by ID:

dt.reg.1 <- dt[,as.list(coef(lm(y ~ x))), by=ID]
dt.reg.1
#      ID (Intercept)         x
#1:    Ed  0.63057884 0.5482373
#2: Frank  0.69720351 1.3813007
#3:  Tony  0.08588421 1.0179131

dt.reg.2 <- dt[,as.list(coef(lm(y ~ x + z))), by=ID]
dt.reg.2
#      ID (Intercept)         x          z
#1:    Ed   0.8262577 0.5587170 -0.2582699
#2: Frank   0.4317538 2.7221024  1.1807442
#3:  Tony   0.1494439 0.3166547 -1.2029693

Now I have to join the three result sets, and rename the columns:

dt.ans <- dt.med[dt.reg.1][dt.reg.2]
setnames(dt.ans,c("ID","y.med","reg.1.c0","reg.1.c1","reg.2.c0","reg.2.c1","reg.2.c2"))

Finally, here is the desired output for the example:

dt.ans
#      ID     y.med   reg.1.c0  reg.1.c1  reg.2.c0  reg.2.c1   reg.2.c2
#1:    Ed 0.4850342 0.63057884 0.5482373 0.8262577 0.5587170 -0.2582699
#2: Frank 0.4978475 0.69720351 1.3813007 0.4317538 2.7221024  1.1807442
#3:  Tony 0.4179568 0.08588421 1.0179131 0.1494439 0.3166547 -1.2029693

It seems inefficient to calculate the three results, join them, and then rename the columns. Also, my actual tables are largish so I'd like to make sure I don't use too much system memory. Is it possible to do this all within "one" data.table statement? Or more generally, can this be done more efficiently?

I've tried different things. Here is one failed example that gives the median but ignores the regression coefficients:

dt[,as.list(median(y),coef(lm(y ~ x))), by=ID]
#      ID        V1
#1:    Ed 0.4850342
#2: Frank 0.4978475
#3:  Tony 0.4179568
like image 612
dnlbrky Avatar asked Oct 22 '13 16:10

dnlbrky


People also ask

What is a table of summary statistics?

The summary table is a visualization that summarizes statistical information about data in table form. The information is based on one data table in TIBCO Spotfire. You can, at any time, choose which measures you want to see (such as mean, median, etc.), as well as the columns on which to base these measures.


1 Answers

dt[,c(y.med = median(y),
      reg.1 = as.list(coef(lm(y ~ x))),
      reg.2 = as.list(coef(lm(y ~ x + z)))), by=ID]
#      ID     y.med reg.1.(Intercept)   reg.1.x reg.2.(Intercept)      reg.2.x   reg.2.z
#1:    Ed 0.7280448        0.75977555 0.1132509        0.83322290 -0.484348116 0.7655563
#2: Frank 0.6100339       -0.07830664 0.2700846        0.04720686  0.004027939 0.7168521
#3:  Tony 0.2710623       -0.78319379 0.9166601       -0.35836990  0.622822617 0.4161102
like image 164
eddi Avatar answered Nov 03 '22 23:11

eddi