Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to group data.table by multiple columns?

I'm using the data.table package to speed up some summary statistic collection on a data set.

I'm curious if there's a way to group by more than one column. My data looks like this:

  purchaseAmt        adShown        url
   15.54            00001         150000001
    4.82            00002         150000001
  157.99            05005         776300044
   ...               ...            ...

I can do something like this:

adShownMedian <- df1[,median(purchaseAmt),by="adShown"]

to get each ad's median. How would I do something that combines adShown and url?

I've tried this:

adShownMedian <- df1[,median(purchaseAmt),by=c("adShown","url")]

but no luck.

Any suggestions?

like image 323
screechOwl Avatar asked Sep 18 '12 14:09

screechOwl


People also ask

How do I group multiple columns?

A shortcut way to group rows or columns is to highlight the rows/columns you wish to group and use ALT+SHIFT+RIGHT ARROW to group the rows/columns, and ALT+SHIFT+LEFT ARROW to ungroup them. You can go multiple levels as well (so you could group rows 1-30, and then group rows 20-25 as a subgroup of the first).

How do you group columns in a table?

Select the data (including any summary rows or columns). On the Data tab, in the Outline group, click Group > Group Rows or Group Columns. Optionally, if you want to outline an inner, nested group — select the rows or columns within the outlined data range, and repeat step 3.

What is .n in data table?

table's . N symbol, where . N stands for “number of rows.” It can be the total number of rows, or number of rows per group if you're aggregating in the “by” section. This expression returns the total number of rows in the data.table: mydt[, . N]


Video Answer


2 Answers

Use by=list(adShown,url) instead of by=c("adShown","url")

Example:

set.seed(007)  DF <- data.frame(X=1:20, Y=sample(c(0,1), 20, TRUE), Z=sample(0:5, 20, TRUE))  library(data.table) DT <- data.table(DF) DT[, Mean:=mean(X), by=list(Y, Z)]        X Y Z      Mean  1:  1 1 3  1.000000  2:  2 0 1  9.333333  3:  3 0 5  7.400000  4:  4 0 5  7.400000  5:  5 0 5  7.400000  6:  6 1 0  6.000000  7:  7 0 3  7.000000  8:  8 1 2 12.500000  9:  9 0 5  7.400000 10: 10 0 2 15.000000 11: 11 0 4 14.500000 12: 12 0 1  9.333333 13: 13 1 1 13.000000 14: 14 0 1  9.333333 15: 15 0 2 15.000000 16: 16 0 5  7.400000 17: 17 1 2 12.500000 18: 18 0 4 14.500000 19: 19 1 5 19.000000 20: 20 0 2 15.000000 
like image 62
Jilber Urbina Avatar answered Sep 27 '22 19:09

Jilber Urbina


To add on Jilber Urbina answer, and address kahlo comment:
if you want to get a single row for each Y - Z combination with the aggregated values you can do

DT[, .(X=mean(X)), by=list(Y, Z)]

that is the same as doing

DT[, .(X=mean(X)), by=.(Y, Z)] 
# or
DT[, .(X=mean(X)), by=c('Y','Z')]
# or specify column names in vector
names = c('Y','Z')
DT[, .(X=mean(X)), by=names] 

(data.table version 1.12.6)

like image 36
Fernando.- Avatar answered Sep 27 '22 18:09

Fernando.-