I have a data table in R: <pre class="prettyprint"><code>library(data.table) set.seed(1234) DT <- data.table(x=rep(c(1,2,3),each=4), y=c("A","B"), v=sample(1:100,12)) DT x y v [1,] 1 A 12 [2,] 1 B 62 [3,] 1 A 60 [4,] 1 B 61 [5,] 2 A 83 [6,] 2 B 97 [7,] 2 A 1 [8,] 2 B 22 [9,] 3 A 99 [10,] 3 B 47 [11,] 3 A 63 [12,] 3 B 49 </code></pre> I can easily sum the variable v by the groups in the data.table: <pre class="prettyprint"><code>out <- DT[,list(SUM=sum(v)),by=list(x,y)] out x y SUM [1,] 1 A 72 [2,] 1 B 123 [3,] 2 A 84 [4,] 2 B 119 [5,] 3 A 162 [6,] 3 B 96 </code></pre> However, I would like to have the groups (y) as columns, rather than rows. I can accomplish this using <code>reshape</code>: <pre class="prettyprint"><code>out <- reshape(out,direction='wide',idvar='x', timevar='y') out x SUM.A SUM.B [1,] 1 72 123 [2,] 2 84 119 [3,] 3 162 96 </code></pre> Is there a more efficient way to reshape the data after aggregating it? Is there any way to combine these operations into one step, using the data.table operations?

The <code>data.table</code> package implements faster <code>melt/dcast</code> functions (in C). It also has additional features by allowing to melt and cast multiple columns. Please see the new Efficient reshaping using data.tables on Github. melt/dcast functions for data.table have been available since v1.9.0 and the features include: <ul> <li>There is no need to load <code>reshape2</code> package prior to casting. But if you want it loaded for other operations, please load it before loading <code>data.table</code>.</li> <li><code>dcast</code> is also a S3 generic. No more <code>dcast.data.table()</code>. Just use <code>dcast()</code>.</li> <li> <code>melt</code>: <ul> <li>is capable of melting on columns of type 'list'.</li> <li>gains <code>variable.factor</code> and <code>value.factor</code> which by default are <code>TRUE</code> and <code>FALSE</code> respectively for compatibility with <code>reshape2</code>. This allows for directly controlling the output type of <code>variable</code> and <code>value</code> columns (as factors or not). </li> <li><code>melt.data.table</code>'s <code>na.rm = TRUE</code> parameter is internally optimised to remove NAs directly during melting and is therefore much more efficient.</li> <li>NEW: <code>melt</code> can accept a list for <code>measure.vars</code> and columns specified in each element of the list will be combined together. This is faciliated further through the use of <code>patterns()</code>. See vignette or <code>?melt</code>.</li> </ul> </li> <li> <code>dcast</code>: <ul> <li>accepts multiple <code>fun.aggregate</code> and multiple <code>value.var</code>. See vignette or <code>?dcast</code>.</li> <li>use <code>rowid()</code> function directly in formula to generate an id-column, which is sometimes required to identify the rows uniquely. See ?dcast.</li> </ul> </li> <li> Old benchmarks: <ul> <li> <code>melt</code> : 10 million rows and 5 columns, 61.3 seconds reduced to 1.2 seconds. </li> <li> <code>dcast</code> : 1 million rows and 4 columns, 192 seconds reduced to 3.6 seconds. </li> </ul> </li> </ul> Reminder of Cologne (Dec 2013) presentation slide 32 : Why not submit a <code>dcast</code> pull request to <code>reshape2</code>?

<h3>This feature is now implemented into data.table (from version 1.8.11 on), as can be seen in Zach's answer above.</h3> I just saw this great chunk of code from Arun here on SO. So I guess there is a <code>data.table</code> solution. Applied to this problem: <pre class="prettyprint"><code>library(data.table) set.seed(1234) DT <- data.table(x=rep(c(1,2,3),each=1e6), y=c("A","B"), v=sample(1:100,12)) out <- DT[,list(SUM=sum(v)),by=list(x,y)] # edit (mnel) to avoid setNames which creates a copy # when calling `names<-` inside the function out[, as.list(setattr(SUM, 'names', y)), by=list(x)] }) x A B 1: 1 26499966 28166677 2: 2 26499978 28166673 3: 3 26500056 28166650 </code></pre> This gives the same results as DWin's approach: <pre class="prettyprint"><code>tapply(DT$v,list(DT$x, DT$y), FUN=sum) A B 1 26499966 28166677 2 26499978 28166673 3 26500056 28166650 </code></pre> Also, it is fast: <pre class="prettyprint"><code>system.time({ out <- DT[,list(SUM=sum(v)),by=list(x,y)] out[, as.list(setattr(SUM, 'names', y)), by=list(x)]}) ## user system elapsed ## 0.64 0.05 0.70 system.time(tapply(DT$v,list(DT$x, DT$y), FUN=sum)) ## user system elapsed ## 7.23 0.16 7.39 </code></pre> UPDATE So that this solution also works for non-balanced data sets (i.e. some combinations do not exist), you have to enter those in the data table first: <pre class="prettyprint"><code>library(data.table) set.seed(1234) DT <- data.table(x=c(rep(c(1,2,3),each=4),3,4), y=c("A","B"), v=sample(1:100,14)) out <- DT[,list(SUM=sum(v)),by=list(x,y)] setkey(out, x, y) intDT <- expand.grid(unique(out[,x]), unique(out[,y])) setnames(intDT, c("x", "y")) out <- out[intDT] out[, as.list(setattr(SUM, 'names', y)), by=list(x)] </code></pre> <hr> Summary Combining the comments with the above, here's the 1-line solution: <pre class="prettyprint"><code>DT[, sum(v), keyby = list(x,y)][CJ(unique(x), unique(y)), allow.cartesian = T][, setNames(as.list(V1), paste(y)), by = x] </code></pre> It's also easy to modify this to have more than just the sum, e.g.: <pre class="prettyprint"><code>DT[, list(sum(v), mean(v)), keyby = list(x,y)][CJ(unique(x), unique(y)), allow.cartesian = T][, setNames(as.list(c(V1, V2)), c(paste0(y,".sum"), paste0(y,".mean"))), by = x] # x A.sum B.sum A.mean B.mean #1: 1 72 123 36.00000 61.5 #2: 2 84 119 42.00000 59.5 #3: 3 187 96 62.33333 48.0 #4: 4 NA 81 NA 81.0 </code></pre>

Proper/fastest way to reshape a data.table

Tags:

r

data.table

I have a data table in R:

library(data.table) set.seed(1234) DT <- data.table(x=rep(c(1,2,3),each=4), y=c("A","B"), v=sample(1:100,12)) DT       x y  v  [1,] 1 A 12  [2,] 1 B 62  [3,] 1 A 60  [4,] 1 B 61  [5,] 2 A 83  [6,] 2 B 97  [7,] 2 A  1  [8,] 2 B 22  [9,] 3 A 99 [10,] 3 B 47 [11,] 3 A 63 [12,] 3 B 49

I can easily sum the variable v by the groups in the data.table:

out <- DT[,list(SUM=sum(v)),by=list(x,y)] out      x  y SUM [1,] 1 A  72 [2,] 1 B 123 [3,] 2 A  84 [4,] 2 B 119 [5,] 3 A 162 [6,] 3 B  96

However, I would like to have the groups (y) as columns, rather than rows. I can accomplish this using reshape:

out <- reshape(out,direction='wide',idvar='x', timevar='y') out      x SUM.A SUM.B [1,] 1    72   123 [2,] 2    84   119 [3,] 3   162    96

Is there a more efficient way to reshape the data after aggregating it? Is there any way to combine these operations into one step, using the data.table operations?

835

asked Aug 01 '11 17:08

Zach

2 Answers

The data.table package implements faster melt/dcast functions (in C). It also has additional features by allowing to melt and cast multiple columns. Please see the new Efficient reshaping using data.tables on Github.

melt/dcast functions for data.table have been available since v1.9.0 and the features include:

There is no need to load reshape2 package prior to casting. But if you want it loaded for other operations, please load it before loading data.table.
dcast is also a S3 generic. No more dcast.data.table(). Just use dcast().
melt:
- is capable of melting on columns of type 'list'.
- gains variable.factor and value.factor which by default are TRUE and FALSE respectively for compatibility with reshape2. This allows for directly controlling the output type of variable and value columns (as factors or not).
- melt.data.table's na.rm = TRUE parameter is internally optimised to remove NAs directly during melting and is therefore much more efficient.
- NEW: melt can accept a list for measure.vars and columns specified in each element of the list will be combined together. This is faciliated further through the use of patterns(). See vignette or ?melt.
dcast:
- accepts multiple fun.aggregate and multiple value.var. See vignette or ?dcast.
- use rowid() function directly in formula to generate an id-column, which is sometimes required to identify the rows uniquely. See ?dcast.
Old benchmarks:
- melt : 10 million rows and 5 columns, 61.3 seconds reduced to 1.2 seconds.
- dcast : 1 million rows and 4 columns, 192 seconds reduced to 3.6 seconds.

Reminder of Cologne (Dec 2013) presentation slide 32 : Why not submit a dcast pull request to reshape2?

179

answered Sep 24 '22 14:09

Zach

This feature is now implemented into data.table (from version 1.8.11 on), as can be seen in Zach's answer above.

I just saw this great chunk of code from Arun here on SO. So I guess there is a data.table solution. Applied to this problem:

library(data.table) set.seed(1234) DT <- data.table(x=rep(c(1,2,3),each=1e6),                    y=c("A","B"),                    v=sample(1:100,12))  out <- DT[,list(SUM=sum(v)),by=list(x,y)] # edit (mnel) to avoid setNames which creates a copy # when calling `names<-` inside the function out[, as.list(setattr(SUM, 'names', y)), by=list(x)] })    x        A        B 1: 1 26499966 28166677 2: 2 26499978 28166673 3: 3 26500056 28166650

This gives the same results as DWin's approach:

tapply(DT$v,list(DT$x, DT$y), FUN=sum)          A        B 1 26499966 28166677 2 26499978 28166673 3 26500056 28166650

Also, it is fast:

system.time({     out <- DT[,list(SUM=sum(v)),by=list(x,y)]    out[, as.list(setattr(SUM, 'names', y)), by=list(x)]}) ##  user  system elapsed  ## 0.64    0.05    0.70  system.time(tapply(DT$v,list(DT$x, DT$y), FUN=sum)) ## user  system elapsed  ## 7.23    0.16    7.39

UPDATE

So that this solution also works for non-balanced data sets (i.e. some combinations do not exist), you have to enter those in the data table first:

library(data.table) set.seed(1234) DT <- data.table(x=c(rep(c(1,2,3),each=4),3,4), y=c("A","B"), v=sample(1:100,14))  out <- DT[,list(SUM=sum(v)),by=list(x,y)] setkey(out, x, y)  intDT <- expand.grid(unique(out[,x]), unique(out[,y])) setnames(intDT, c("x", "y")) out <- out[intDT]  out[, as.list(setattr(SUM, 'names', y)), by=list(x)]

Summary

Combining the comments with the above, here's the 1-line solution:

DT[, sum(v), keyby = list(x,y)][CJ(unique(x), unique(y)), allow.cartesian = T][,    setNames(as.list(V1), paste(y)), by = x]

It's also easy to modify this to have more than just the sum, e.g.:

DT[, list(sum(v), mean(v)), keyby = list(x,y)][CJ(unique(x), unique(y)), allow.cartesian = T][,    setNames(as.list(c(V1, V2)), c(paste0(y,".sum"), paste0(y,".mean"))), by = x] #   x A.sum B.sum   A.mean B.mean #1: 1    72   123 36.00000   61.5 #2: 2    84   119 42.00000   59.5 #3: 3   187    96 62.33333   48.0 #4: 4    NA    81       NA   81.0

answered Sep 22 '22 14:09

Christoph_J

Related questions
                            
                                The R %in% operator
                            
                                Global variables in packages in R
                            
                                Insert a character at a specific location in a string
                            
                                Subsetting R data frame results in mysterious NA rows
                            
                                Change bar plot colour in geom_bar with ggplot2 in r
                            
                                Common main title of a figure panel compiled with par(mfrow)
                            
                                Controlling the order of points in ggplot2?
                            
                                Generate list of all possible combinations of elements of vector
                            
                                Read a CSV from github into R
                            
                                Formatting dates on X axis in ggplot2
                            
                                How to organize large Shiny apps?
                            
                                Unseen factor levels when appending new records with unseen string values to a dataframe, cause Warning and result in NA
                            
                                Custom legend for multiple layer ggplot
                            
                                How to specify "does not contain" in dplyr filter
                            
                                Forcing garbage collection to run in R with the gc() command
                            
                                ggplot2, facet_grid, free scales?
                            
                                How can I check whether a function call results in a warning?
                            
                                Calculate row means on subset of columns
                            
                                Access variable value where the name of variable is stored in a string
                            
                                How can I spread repeated measures of multiple variables into wide format?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Proper/fastest way to reshape a data.table

Tags:

r

data.table

Zach

People also ask

2 Answers

Zach

This feature is now implemented into data.table (from version 1.8.11 on), as can be seen in Zach's answer above.

Christoph_J

Recent Activity

Donate For Us