This question pertains to creating "wide" tables similar to tables you could create using dcast from reshape2. I know this has been discussed many times before, but my question pertains to how to make the process more efficient. I have provided several examples below which might make the question seem lengthy, but most of it is just test code for benchmarking
Starting with a simple example,
> z <- data.table(col1=c(1,1,2,3,4), col2=c(10,10,20,20,30), col3=c(5,2,2.3,2.4,100), col4=c("a","a","b","c","a")) > z col1 col2 col3 col4 1: 1 10 5.0 a # col1 = 1, col2 = 10 2: 1 10 2.0 a # col1 = 1, col2 = 10 3: 2 20 2.3 b 4: 3 20 2.4 c 5: 4 30 100.0 a
We need to create a "wide" table that will have the values of the col4 column as column names and the value of the sum(col3) for each combination of col1 and col2.
> ulist = unique(z$col4) # These will be the additional column names # Create long table with sum > z2 <- z[,list(sumcol=sum(col3)), by='col1,col2,col4'] # Pivot the long table > z2 <- z2[,as.list((sumcol[match(ulist,col4)])), by=c("col1","col2")] # Add column names > setnames(z2[],c("col1","col2",ulist)) > z2 col1 col2 a b c 1: 1 10 7 NA NA # a = 5.0 + 2.0 = 7 corresponding to col1=1, col2=10 2: 2 20 NA 2.3 NA 3: 3 20 NA NA 2.4 4: 4 30 100 NA NA
The issue I have is that while the above method is fine for smaller tables, it's virtually impossible to run them (unless you are fine with waiting x hours maybe) on very large tables.
This, I believe is likely related to the fact that the pivoted / wide table is of a much larger size than the original tables since each row in the wide table has n columns corresponding to the unique values of the pivot column no matter whether there is any value that corresponds to that cell (these are the NA values above). The size of the new table is therefore often 2x+ that of the original "long" table.
My original table has ~ 500 million rows, about 20 unique values. I have attempted to run the above using only 5 million rows and it takes forever in R (too long to wait for it to complete).
For benchmarking purposes, the example (using 5 million rows) - completes in about 1 minute using production rdbms systems running multithreaded. It completes in about 8 "seconds" using single core using KDB+/Q (http://www.kx.com). It might not be a fair comparison, but gives a sense that it is possible to do these operations much faster using alternative means. KDB+ doesn't have sparse rows, so it is allocating memory for all the cells and still much faster than anything else I have tried.
What I need however, is an R solution :) and so far, I haven't found an efficient way to perform similar operations.
If you have had experience and could reflect upon any alternative / more optimal solution, I'd be interested in knowing the same. A sample code is provided below. You can vary the value for n to simulate the results. The unique values for the pivot column (column c3) have been fixed at 25.
n = 100 # Increase this to benchmark z <- data.table(c1=sample(1:10000,n,replace=T), c2=sample(1:100000,n,replace=T), c3=sample(1:25,n,replace=T), price=runif(n)*10) c3.unique <- 1:25 z <- z[,list(sumprice=sum(price)), by='c1,c2,c3'][,as.list((sumprice[match(c3.unique,c3)])), by='c1,c2'] setnames(z[], c("c1","c2",c3.unique))
Thanks,
To reshape the dataframe from long to wide in Pandas, we can use Pandas' pd. pivot() method. columns : Column to use to make new frame's columns (e.g., 'Year Month'). values : Column(s) to use for populating new frame's values (e.g., 'Avg.
pivot_wider(): Reshapes a data frame from long to wide format.
The setDT() method can be used to coerce the dataframe or the lists into data. table, where the conversion is made to the original dataframe.
For n=1e6
the following takes about 10 seconds with plain dcast
and about 4 seconds with dcast.data.table
:
library(reshape2) dcast(z[, sum(price), by = list(c1, c2, c3)], c1 + c2 ~ c3) # or with 1.8.11 dcast.data.table(z, c1 + c2 ~ c3, fun = sum)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With