Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

aggregation with data.table in R

Tags:

r

data.table

The exercise consists in aggregating a numeric vector of values by a combination of factors with data.table in R. Take the following data table as example:

require (data.table)
require (plyr)
dtb <- data.table (cbind (expand.grid (month = rep (month.abb[1:3], each = 3),
                                       fac = letters[1:3]),
                          value = rnorm (27)))

Notice that every unique combination of 'month' and 'fac' shows up three times. So, when I try to average values by both these factors, I should expect a data frame with 9 unique rows:

(agg1 <- ddply (dtb, c ("month", "fac"), function (dfr) mean (dfr$value)))
  month fac          V1
1   Jan   a -0.36030953
2   Jan   b -0.58444588
3   Jan   c -0.15472876
4   Feb   a -0.05674483
5   Feb   b  0.26415972
6   Feb   c -1.62346772
7   Mar   a  0.24560510
8   Mar   b  0.82548140
9   Mar   c  0.18721114

However, when aggregating with data.table, I keep getting the results provided by every redundant combination of the two factors:

(agg2 <- dtb[, value := mean (value), by = list (month, fac)])
    month fac       value
 1:   Jan   a -0.36030953
 2:   Jan   a -0.36030953
 3:   Jan   a -0.36030953
 4:   Feb   a -0.05674483
 5:   Feb   a -0.05674483
 6:   Feb   a -0.05674483
 7:   Mar   a  0.24560510
 8:   Mar   a  0.24560510
 9:   Mar   a  0.24560510
10:   Jan   b -0.58444588
11:   Jan   b -0.58444588
12:   Jan   b -0.58444588
13:   Feb   b  0.26415972
14:   Feb   b  0.26415972
15:   Feb   b  0.26415972
16:   Mar   b  0.82548140
17:   Mar   b  0.82548140
18:   Mar   b  0.82548140
19:   Jan   c -0.15472876
20:   Jan   c -0.15472876
21:   Jan   c -0.15472876
22:   Feb   c -1.62346772
23:   Feb   c -1.62346772
24:   Feb   c -1.62346772
25:   Mar   c  0.18721114
26:   Mar   c  0.18721114
27:   Mar   c  0.18721114
    month fac       value

Is there an elegant way to collapse these results to one row per unique combination of factors with data table?

like image 655
Gil Tomás Avatar asked Mar 05 '13 19:03

Gil Tomás


2 Answers

The issue (and reasoning) is related to the fact that aggregated value is being assigned not just calculated.

It is easier to observe this in action if you look at a data.table with more columns than just the ones being used for the computation.

# Therefore, let's add a new column
dtb[, newCol := LETTERS[seq(length(value))]

Note that if we just want to output the computed value, then expression on the RHS as you have it is just fine.

# This gives the expected results
dtb[, mean (value), by = list (month, fac)]

# This on the other hand assigns the respective values to *each* row
dtb[, value := mean (value), by = list (month, fac)]

In other words, the data is being subsetted to only return unique values.
However, if you want to save this value back into the SAME data table (which is what happens when using := operator) then all rows that are identified in i (all rows by defualt) will be assigned a value. (which, when you look at the output with additional columns, makes sense)

Then copying this data.table to agg still sends through all the rows.

Therefore, if you want to copy to a new table, only those rows from your original table that are unique, you can

a.  wrap the original table inside `unique()` before assigning it
b.  assign the table, above, that is returned when you 
    are not assigning the RHS output (which is what @Arun suggested)

An example of a. would be:

 agg2 <- unique(dtb[, value := mean (value), by = list (month, fac)])

The following example might help illustrate.

(You would need to copy + paste this, as the output is ommitted)

  # SAMPLE DATA, as above
  library(data.table)
  dtb.bak <- data.table (expand.grid (month = rep (month.abb[1:3], each = 3), fac = letters[1:3]), value = rnorm (27))

  #  METHOD 1  # 
  #------------#
  dtb <- copy(dtb.bak)  # restore, from sample data.


  dtb[, value := mean (value), by = list (month, fac)]
  dtb

  # this is what you would like to assign
  unique(dtb)


  #  METHOD 2  # 
  #------------#
  dtb <- copy(dtb.bak)  # restore, from sample data.

  # this is what you would like to assign
  # next two lines are the same, only differnce is column name
  dtb[, mean (value), by = list (month, fac)]
  dtb[, list("mean" = mean (value)), by = list (month, fac)]  # quote marks added for clarity

  # dtb is unchanged. 
  dtb



  # NOW COMPARE THE SAME TWO METHODS, BUT IF THERE IS AN ADDITIOANL COLUMN
  dtb.bak[, newCol := rep(c("A", "B", "A"), length(value)/3)]


  dtb1 <- copy(dtb.bak)  # restore, from sample data.
  dtb2 <- copy(dtb.bak)  # restore, from sample data.


  # Method 1
  dtb1[, value := mean (value), by = list (month, fac)]
  dtb1
  unique(dtb1)

  #  METHOD 2  # 
  dtb2[, list("mean" = mean (value)), by = list (month, fac)]  # quote marks added for clarity
  dtb2

  # METHOD 2, WITH ADDED COLUMNS IN list() in `j`
  dtb2[, list("mean" = mean (value), newCol), by = list (month, fac)]  # quote marks added for clarity
  # notice this has more columns thatn 
  unique(dtb1)
like image 67
Ricardo Saporta Avatar answered Nov 01 '22 07:11

Ricardo Saporta


You should do:

agg2 <- dtb[, list(value = mean(value)), by = list (month, fac)]

:= will recycle values for RHS to fit the number of elements in LHS. Do ?':=' to read more about this.

like image 36
Arun Avatar answered Nov 01 '22 07:11

Arun