Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient ways to reshape huge data from long to wide format - similar to dcast

Tags:

This question pertains to creating "wide" tables similar to tables you could create using dcast from reshape2. I know this has been discussed many times before, but my question pertains to how to make the process more efficient. I have provided several examples below which might make the question seem lengthy, but most of it is just test code for benchmarking

Starting with a simple example,

> z <- data.table(col1=c(1,1,2,3,4), col2=c(10,10,20,20,30),                    col3=c(5,2,2.3,2.4,100), col4=c("a","a","b","c","a"))  > z      col1 col2  col3 col4 1:    1   10   5.0    a      # col1 = 1, col2 = 10 2:    1   10   2.0    a      # col1 = 1, col2 = 10 3:    2   20   2.3    b 4:    3   20   2.4    c 5:    4   30 100.0    a 

We need to create a "wide" table that will have the values of the col4 column as column names and the value of the sum(col3) for each combination of col1 and col2.

> ulist = unique(z$col4) # These will be the additional column names  # Create long table with sum > z2 <- z[,list(sumcol=sum(col3)), by='col1,col2,col4']  # Pivot the long table > z2 <- z2[,as.list((sumcol[match(ulist,col4)])), by=c("col1","col2")]  # Add column names > setnames(z2[],c("col1","col2",ulist))  > z2    col1 col2   a   b   c 1:    1   10   7  NA  NA  # a = 5.0 + 2.0 = 7 corresponding to col1=1, col2=10 2:    2   20  NA 2.3  NA 3:    3   20  NA  NA 2.4 4:    4   30 100  NA  NA 

The issue I have is that while the above method is fine for smaller tables, it's virtually impossible to run them (unless you are fine with waiting x hours maybe) on very large tables.

This, I believe is likely related to the fact that the pivoted / wide table is of a much larger size than the original tables since each row in the wide table has n columns corresponding to the unique values of the pivot column no matter whether there is any value that corresponds to that cell (these are the NA values above). The size of the new table is therefore often 2x+ that of the original "long" table.

My original table has ~ 500 million rows, about 20 unique values. I have attempted to run the above using only 5 million rows and it takes forever in R (too long to wait for it to complete).

For benchmarking purposes, the example (using 5 million rows) - completes in about 1 minute using production rdbms systems running multithreaded. It completes in about 8 "seconds" using single core using KDB+/Q (http://www.kx.com). It might not be a fair comparison, but gives a sense that it is possible to do these operations much faster using alternative means. KDB+ doesn't have sparse rows, so it is allocating memory for all the cells and still much faster than anything else I have tried.

What I need however, is an R solution :) and so far, I haven't found an efficient way to perform similar operations.

If you have had experience and could reflect upon any alternative / more optimal solution, I'd be interested in knowing the same. A sample code is provided below. You can vary the value for n to simulate the results. The unique values for the pivot column (column c3) have been fixed at 25.

n = 100 # Increase this to benchmark  z <- data.table(c1=sample(1:10000,n,replace=T),     c2=sample(1:100000,n,replace=T),     c3=sample(1:25,n,replace=T),     price=runif(n)*10)  c3.unique <- 1:25  z <- z[,list(sumprice=sum(price)), by='c1,c2,c3'][,as.list((sumprice[match(c3.unique,c3)])), by='c1,c2'] setnames(z[], c("c1","c2",c3.unique)) 

Thanks,

  • Raj.
like image 351
xbsd Avatar asked Sep 14 '13 13:09

xbsd


People also ask

How do I change my data frame from long to wide?

To reshape the dataframe from long to wide in Pandas, we can use Pandas' pd. pivot() method. columns : Column to use to make new frame's columns (e.g., 'Year Month'). values : Column(s) to use for populating new frame's values (e.g., 'Avg.

Which function reshapes long format data to wide format?

pivot_wider(): Reshapes a data frame from long to wide format.

What does setDT do in R?

The setDT() method can be used to coerce the dataframe or the lists into data. table, where the conversion is made to the original dataframe.


1 Answers

For n=1e6 the following takes about 10 seconds with plain dcast and about 4 seconds with dcast.data.table:

library(reshape2)  dcast(z[, sum(price), by = list(c1, c2, c3)], c1 + c2 ~ c3)  # or with 1.8.11 dcast.data.table(z, c1 + c2 ~ c3, fun = sum) 
like image 160
eddi Avatar answered Sep 30 '22 17:09

eddi