Efficient ways to reshape huge data from long to wide format - similar to dcast

Tags:

This question pertains to creating "wide" tables similar to tables you could create using dcast from reshape2. I know this has been discussed many times before, but my question pertains to how to make the process more efficient. I have provided several examples below which might make the question seem lengthy, but most of it is just test code for benchmarking

Starting with a simple example,

> z <- data.table(col1=c(1,1,2,3,4), col2=c(10,10,20,20,30),                    col3=c(5,2,2.3,2.4,100), col4=c("a","a","b","c","a"))  > z      col1 col2  col3 col4 1:    1   10   5.0    a      # col1 = 1, col2 = 10 2:    1   10   2.0    a      # col1 = 1, col2 = 10 3:    2   20   2.3    b 4:    3   20   2.4    c 5:    4   30 100.0    a

We need to create a "wide" table that will have the values of the col4 column as column names and the value of the sum(col3) for each combination of col1 and col2.

> ulist = unique(z$col4) # These will be the additional column names  # Create long table with sum > z2 <- z[,list(sumcol=sum(col3)), by='col1,col2,col4']  # Pivot the long table > z2 <- z2[,as.list((sumcol[match(ulist,col4)])), by=c("col1","col2")]  # Add column names > setnames(z2[],c("col1","col2",ulist))  > z2    col1 col2   a   b   c 1:    1   10   7  NA  NA  # a = 5.0 + 2.0 = 7 corresponding to col1=1, col2=10 2:    2   20  NA 2.3  NA 3:    3   20  NA  NA 2.4 4:    4   30 100  NA  NA

The issue I have is that while the above method is fine for smaller tables, it's virtually impossible to run them (unless you are fine with waiting x hours maybe) on very large tables.

This, I believe is likely related to the fact that the pivoted / wide table is of a much larger size than the original tables since each row in the wide table has n columns corresponding to the unique values of the pivot column no matter whether there is any value that corresponds to that cell (these are the NA values above). The size of the new table is therefore often 2x+ that of the original "long" table.

My original table has ~ 500 million rows, about 20 unique values. I have attempted to run the above using only 5 million rows and it takes forever in R (too long to wait for it to complete).

For benchmarking purposes, the example (using 5 million rows) - completes in about 1 minute using production rdbms systems running multithreaded. It completes in about 8 "seconds" using single core using KDB+/Q (http://www.kx.com). It might not be a fair comparison, but gives a sense that it is possible to do these operations much faster using alternative means. KDB+ doesn't have sparse rows, so it is allocating memory for all the cells and still much faster than anything else I have tried.

What I need however, is an R solution :) and so far, I haven't found an efficient way to perform similar operations.

If you have had experience and could reflect upon any alternative / more optimal solution, I'd be interested in knowing the same. A sample code is provided below. You can vary the value for n to simulate the results. The unique values for the pivot column (column c3) have been fixed at 25.

n = 100 # Increase this to benchmark  z <- data.table(c1=sample(1:10000,n,replace=T),     c2=sample(1:100000,n,replace=T),     c3=sample(1:25,n,replace=T),     price=runif(n)*10)  c3.unique <- 1:25  z <- z[,list(sumprice=sum(price)), by='c1,c2,c3'][,as.list((sumprice[match(c3.unique,c3)])), by='c1,c2'] setnames(z[], c("c1","c2",c3.unique))

Thanks,

Raj.

351

asked Sep 14 '13 13:09

xbsd

1 Answers

For n=1e6 the following takes about 10 seconds with plain dcast and about 4 seconds with dcast.data.table:

library(reshape2)  dcast(z[, sum(price), by = list(c1, c2, c3)], c1 + c2 ~ c3)  # or with 1.8.11 dcast.data.table(z, c1 + c2 ~ c3, fun = sum)

160

answered Sep 30 '22 17:09

eddi

Related questions
                            
                                Ping IP Address with VBA code and return results in Excel
                            
                                How to determine what is the probability distribution function from a numpy array?
                            
                                Can a plain `char` possibly have trap values?
                            
                                Scalable Node.js application architecture
                            
                                How to debug/analyze extremely long GC pauses in Node.js/V8
                            
                                QSyntaxHighlighter - text selection overrides style
                            
                                Prevent reCaptcha multiple image selections
                            
                                Java default methods is slower than the same code but in an abstract class
                            
                                Xcode 7 debug output: "ERROR:177: timed out...mMajorChangePending=0"
                            
                                Visual Studio 2015: The hash value is not correct
                            
                                One Component Multiple Templates based on Condition
                            
                                Thread safety of std::random_device

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With