Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R performance with data reshaping

Tags:

performance

r

I am trying to reshape a data frame in R and it seems to have problems using the recommended ways of doing so. The data frame has the following structure:

ID                     DATE1             DATE2            VALTYPE        VALUE
'abcd1233'         2009-11-12        2009-12-23           'TYPE1'        123.45
...

VALTYPE is a string and is a factor with only 2 values (say TYPE1 and TYPE2). I need to transform it into the following data frame ("wide" transpose) based on common ID and DATEs:

ID                     DATE1             DATE2            VALUE.TYPE1  VALUE.TYPE2
'abcd1233'             2009-11-12        2009-12-23       123.45           NA
...

The data frame has more than 4,500,000 observations (although about 70% of VALUEs are NA). The machine is an Intel-based Linux workstation with 4Gb of RAM. Loading the data (from a compressed Rdata file) into a fresh R process makes it grow to about 250Mb which clearly leaves a lot of space for reshaping.

These are my experiences so far:

  • Using vanilla reshape() method:

    tbl2 <- reshape(tbl, direction = "wide", idvar = c("ID", "DATE1", "DATE2"), timevar = "VALTYPE");

RESULT: Error: cannot allocate vector of size 4.8 Gb

  • Using cast() method of reshape package:

    tbl2 <- cast(tbl, ID + DATE1 + DATE2 ~ VALTYPE);

RESULT: R process consumes all RAM with no end in sight. Had to kill the process eventually.

  • Using by() and merge():

    sp <- by(tbl[c(1,2,3,5)], tbl$VALTYPE, function(x) x); tbl <- merge(sp[["TYPE1"]], sp[["TYPE2"]], by = c("ID", "DATE1", "DATE2"), all = TRUE, sort = TRUE);

RESULT: works fine, although this is not very elegant and foolproof (i.e. it will break if more types are added).

To add insult to injury, the operation in question can be trivially achieved in about 3 lines of AWK or Perl (and with hardly any RAM used). So the question is: what is a better way to do this operation in R using recommended methods without consuming all available RAM?

like image 317
Alexander L. Belikoff Avatar asked Dec 07 '09 15:12

Alexander L. Belikoff


1 Answers

A useful trick is to combine the id variables into a character vector and then do the reshape.

tbl$NEWID <- with(tbl, paste(ID, DATE1, DATE2, sep=";"))
tbl2 <- recast(tbl2, NEWID ~ VALTYPE, measure.var="VALUE")

It's about 40% faster in a problem of similar size in my intel core2 duo 2.2ghz macbook.

like image 64
Eduardo Leoni Avatar answered Sep 28 '22 04:09

Eduardo Leoni