Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

rearrange data.frame to get the sequential order of products

Tags:

date

r

I have a dataframe of the following form:

df <- data.frame(client = c("client1", "client1", "client2", "client3", "client3"),
                 product = c("A", "B", "A", "D", "A"),
                 purchase_Date = c("2010-03-22", "2010-02-02", "2009-03-02", "2011-04-05", "2012-11-01"))
df$purchase_Date <- as.Date(df$purchase_Date, format = "%Y-%m-%d")

which looks like this:

   client product purchase_Date
1 client1       A    2010-03-02
2 client1       B    2010-02-02
3 client2       A    2009-03-02
4 client3       D    2011-04-05
5 client3       A    2012-11-01

which I would like to rearrange like this:

   client purchase1 purchase2
1 client1         B         A
2 client2         A      <NA>
3 client3         D         A

so I would like to find out which product was the first one, second one, third one and so on, each person ordered by the purchase-Date. I can easily get each one individually using data.table:

library(data.table)
setDT(df)[ , .SD[order(-purchase_Date), product][1], by = client]

for the first one. but I have no idea how to efficiently get the desired output.

like image 373
grrgrrbla Avatar asked Jun 23 '15 15:06

grrgrrbla


2 Answers

Here's a possible data.table solution (if you have more than 10 purchases, then I'd recommend avoiding using paste0 and just use indx := seq_len(.N) instead as it could potentially mess up the purchase order)

setDT(df)[order(purchase_Date), indx := paste0("purchase", seq_len(.N)), by = client]
dcast(df, client ~ indx, value.var = "product")
#     client purchase1 purchase2
# 1: client1         B         A
# 2: client2         A        NA
# 3: client3         D         A

Comparison between frank() and order() approaches to create indx col:

require(data.table)
set.seed(45L); 
dt = data.table(client = sample(paste("client", 1:1e4, sep=""), 1e6, TRUE))
dt[, `:=`(product = sample(paste("p", 1:200, sep=""), .N, FALSE), 
          purchase_Date = as.Date(sample(14610:16586, .N, FALSE), 
           origin = "1970-01-01")), by=client]

system.time(dt[order(purchase_Date), indx := seq_len(.N), by = client])
# user  system elapsed 
# 0.19    0.02    0.20 
system.time(dt[, purch_rank := frank(purchase_Date, ties.method = "dense"), by=client])
# user  system elapsed 
# 3.94    0.00    3.98 
like image 79
David Arenburg Avatar answered Nov 15 '22 10:11

David Arenburg


A dplyr/tidyr approach:

library(dplyr)
library(tidyr)

df %>%
  group_by(client) %>%
  mutate(purch_rank = dense_rank(purchase_Date)) %>%
  select(-purchase_Date) %>%
  spread(purch_rank, product)
#Source: local data frame [3 x 3]
#
#   client 1  2
#1 client1 B  A
#2 client2 A NA
#3 client3 D  A

And a possible data.table approach:

library(data.table) #v 1.9.5+ currently from GitHub for "frank"
setDT(df)[, purch_rank := frank(purchase_Date, ties.method = "dense"), by=client]
dcast(df, client ~ purch_rank, value.var = "product")
#    client 1  2
#1: client1 B  A
#2: client2 A NA
#3: client3 D  A
like image 40
talat Avatar answered Nov 15 '22 11:11

talat