Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R Reshape Performance

EDIT: In creating a simple sample data.frame I used the same dates for the two Date columns however this is not the case, which makes this problem harder.

Instead of this dataframe:

ID     Date           Balance    Date2        Balance2
1      01-01-2014     10000      01-02-2014   5000
2      01-01-2014     50000      01-02-2014   30000
3      01-01-2014     30000      01-02-2014   15000 
4      01-01-2014     5000       01-02-2014   3500

I have this dataframe instead:

ID     Date           Balance    Date2        Balance2
1      01-01-2014     10000      01-02-2017   5000
2      01-01-2015     50000      01-02-2016   30000
3      01-08-2014     30000      01-02-2015   15000 
4      01-02-2016     5000       01-02-2018   3500

Which I would like to reshape to the following:

ID     Date           Balance
1      01-01-2014     10000      
1      02-02-2017     5000
2      01-01-2015     50000      
2      01-02-2016     30000      
3      ...            ...        And so on...

I have the following at the moment.

Dates = a character containing all the columns with Dates (Date, Date2, Date3...)
Balances = a character containing all the columns with Balances (Balance1, Balance2...)

df <- reshape(df,
               varying = Balances,
               v.names = "Balance"
               timevar = "Date"
               times = Dates,
               direction = "long")

The results with your excellently proposed methods does not get me the results when I changed my sample data.frame / data.table.

The main problem is that I have different dates in the dates column, there is no way I can change this. Date1 - Date2 - Date3 are always in chronological order though.

I need a way where R understands it needs to take the Date column and the Balance column, place it in a new DF, then take Date2 and Balance2, rbind them with the first DF, then Date3, Balance3 and so on, until I got my 700ish variables.

I'm thinking of writing a loop, any thoughts? See below for sample data.

Thanks in advance,

Robert

df <- data.frame(ID=seq(1:4),
                Date= c("01-01-2014","01-01-2015","01-08-2014","01-02-2016"),
                Balance = c(10000,50000,30000,5000),
                Date2= c("01-02-2017","01-02-2016","01-02-2015","01-02-2018"),
            Balance2 = c(5000,30000,15000,3500))
like image 312
Robert Luyt Avatar asked Feb 17 '15 13:02

Robert Luyt


2 Answers

If your columns are named as you've provided in your example, you can try merged.stack from my "splitstackshape" package. Note that the values in your "ID" column must be unique to work correctly though (as they are in your sample data).

Usage is straightforward: Specify the "stubs" of the variables (here, "Date" and "Balance"). Setting sep = "var.stubs" just strips out the rest of the column name. the [, .time_1 := NULL] is just to drop the time column that was created in the reshaping process.

library(splitstackshape)
merged.stack(mydf, var.stubs = c("Date", "Balance"), 
             sep = "var.stubs")[, .time_1 := NULL][]
#    ID       Date Balance
# 1:  1 01-01-2014   10000
# 2:  1 01-02-2014    5000
# 3:  2 01-01-2014   50000
# 4:  2 01-02-2014   30000
# 5:  3 01-01-2014   30000
# 6:  3 01-02-2014   15000
# 7:  4 01-01-2014    5000
# 8:  4 01-02-2014    3500

Soon (version 1.9.8 of "data.table") melt would be able to handle conversion to a semi-long form like you're trying to get here. That would be faster than merged.stack presently is, but merged.stack should already be able to handle your present scenario.

like image 105
A5C1D2H2I1M1N2O1R2T1 Avatar answered Oct 06 '22 19:10

A5C1D2H2I1M1N2O1R2T1


If you care about order than probably the fastest method will come from data.table answers. But if you don't then you could just bind the rows of the first three columns with the first and last two using rbind. That will be very fast and simple but not have the order you desire. You can reorder with the order function on ID.

Alternatively you could generate two matrices, transpose, and then bind it all together as vectors. This will be pretty fast because you're just making a few copies and selections and the reordering is done through just identifying the data in a different way rather than relying on a sorting algorithm.

dateMat <- as.matrix(df[, c(2, 4)])
balMat  <- as.matrix(df[, c(3, 5)])
dates <- as.vector( t(dateMat) )
balances <- as.vector( t(balMat) )
dfl <- data.frame(ID = rep(df$ID, each = 2), Date = dates, Balance = balances)

You can test the two versions out for speed on your large data.frame.

like image 42
John Avatar answered Oct 06 '22 19:10

John