I have a dataframe of the following form:
df <- data.frame(client = c("client1", "client1", "client2", "client3", "client3"),
product = c("A", "B", "A", "D", "A"),
purchase_Date = c("2010-03-22", "2010-02-02", "2009-03-02", "2011-04-05", "2012-11-01"))
df$purchase_Date <- as.Date(df$purchase_Date, format = "%Y-%m-%d")
which looks like this:
client product purchase_Date
1 client1 A 2010-03-02
2 client1 B 2010-02-02
3 client2 A 2009-03-02
4 client3 D 2011-04-05
5 client3 A 2012-11-01
which I would like to rearrange like this:
client purchase1 purchase2
1 client1 B A
2 client2 A <NA>
3 client3 D A
so I would like to find out which product was the first one, second one, third one and so on, each person ordered by the purchase-Date. I can easily get each one individually using data.table:
library(data.table)
setDT(df)[ , .SD[order(-purchase_Date), product][1], by = client]
for the first one. but I have no idea how to efficiently get the desired output.
Here's a possible data.table
solution (if you have more than 10 purchases, then I'd recommend avoiding using paste0
and just use indx := seq_len(.N)
instead as it could potentially mess up the purchase order)
setDT(df)[order(purchase_Date), indx := paste0("purchase", seq_len(.N)), by = client]
dcast(df, client ~ indx, value.var = "product")
# client purchase1 purchase2
# 1: client1 B A
# 2: client2 A NA
# 3: client3 D A
Comparison between frank()
and order()
approaches to create indx
col:
require(data.table)
set.seed(45L);
dt = data.table(client = sample(paste("client", 1:1e4, sep=""), 1e6, TRUE))
dt[, `:=`(product = sample(paste("p", 1:200, sep=""), .N, FALSE),
purchase_Date = as.Date(sample(14610:16586, .N, FALSE),
origin = "1970-01-01")), by=client]
system.time(dt[order(purchase_Date), indx := seq_len(.N), by = client])
# user system elapsed
# 0.19 0.02 0.20
system.time(dt[, purch_rank := frank(purchase_Date, ties.method = "dense"), by=client])
# user system elapsed
# 3.94 0.00 3.98
A dplyr/tidyr approach:
library(dplyr)
library(tidyr)
df %>%
group_by(client) %>%
mutate(purch_rank = dense_rank(purchase_Date)) %>%
select(-purchase_Date) %>%
spread(purch_rank, product)
#Source: local data frame [3 x 3]
#
# client 1 2
#1 client1 B A
#2 client2 A NA
#3 client3 D A
And a possible data.table approach:
library(data.table) #v 1.9.5+ currently from GitHub for "frank"
setDT(df)[, purch_rank := frank(purchase_Date, ties.method = "dense"), by=client]
dcast(df, client ~ purch_rank, value.var = "product")
# client 1 2
#1: client1 B A
#2: client2 A NA
#3: client3 D A
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With