This is what my dataframe looks like. The two rightmost columns are my desired columns. I am counting the cumulative number of unique FundTypes as of each row.The 4th columns is the cumulative unique count for all "ActivityType" and the 5th column is the cumulative unique count for only "ActivityType=="Sale".
dt <- read.table(text='
Name ActivityType FundType UniqueFunds(AllTypes) UniqueFunds(SaleOnly)
John Email a 1 0
John Sale a;b 2 2
John Webinar c;d 4 2
John Sale b 4 2
John Webinar e 5 2
John Conference b;d 5 2
John Sale b;e 5 3
Tom Email a 1 0
Tom Sale a;b 2 2
Tom Webinar c;d 4 2
Tom Sale b 4 2
Tom Webinar e 5 2
Tom Conference b;d 5 2
Tom Sale b;e;f 6 4
', header=T, row.names = NULL)
I have tried dt[, UniqueFunds := cumsum(!duplicated(FundType)& !FundType=="") ,by = Name]
but for example it counts a & a;b & c;d as 3 unique values as opposed to the desired 4 unique values as the factors are separated by semicolon.Kindly let me know of a solution.
UPDATE: My real dataset looks more like this:
dt <- read.table(text='
Name ActivityType FundType UniqueFunds(AllTypes) UniqueFunds(SaleOnly)
John Email "" 0 0
John Conference "" 0 0
John Email a 1 0
John Sale a;b 2 2
John Webinar c;d 4 2
John Sale b 4 2
John Webinar e 5 2
John Conference b;d 5 2
John Sale b;e 5 3
John Email "" 5 3
John Webinar "" 5 3
Tom Email a 1 0
Tom Sale a;b 2 2
Tom Webinar c;d 4 2
Tom Sale b 4 2
Tom Webinar e 5 2
Tom Conference b;d 5 2
Tom Sale b;e;f 6 4
', header=T, row.names = NULL)
The unique cumulative vectors need to take into account the missing values.
nrussell suggested a concise solution writing a custom function. Let me drop what I got. I tried to used cumsum()
and duplicated()
as you tried. I did two major operations. One for alltype
and the other for saleonly
. First, I created indices for each name. Then, I split FundType
and formatted the data in a long format with cSplit()
from the splitstackshape package. Then, I chose the last row for each index number for each Name. Finally, I chose only one column, alltype
.
library(splitstackshape)
library(zoo)
library(data.table)
setDT(dt)[, ind := 1:.N, by = "Name"]
cSplit(dt, "FundType", sep = ";", direction = "long")[,
alltype := cumsum(!duplicated(FundType)), by = "Name"][,
.SD[.N], by = c("Name", "ind")][, list(alltype)] -> alltype
The second operation was for saleonly. Basically, I repeated the same approach to a subsetted data for sale, which is ana
. I also created a data set without sale, which is ana2
. Then, I created a list with the two data sets (i.e., l
) and bound them. I changed the order of the dataset with Name
and ind
, take the last rows for each name and index number, taking care of NAs (filling NAs and replacing the first NA for each Name with 0), and finally chose one column. The final operation was to combine the original dt
, alltype
, and saleonly
.
# data for sale only
cSplit(dt, "FundType", sep = ";", direction = "long")[
ActivityType == "Sale"][,
saleonly := cumsum(!duplicated(FundType)), by = "Name"] -> ana
# Data without sale
cSplit(dt, "FundType", sep = ";", direction = "long")[
ActivityType != "Sale"] -> ana2
# Combine ana and ana2
l <- list(ana, ana2)
rbindlist(l, use.names = TRUE, fill = TRUE) -> temp
setorder(temp, Name, ind)[,
.SD[.N], by = c("Name", "ind")][,
saleonly := na.locf(saleonly, na.rm = FALSE), by = "Name"][,
saleonly := replace(saleonly, is.na(saleonly), 0)][, list(saleonly)] -> saleonly
cbind(dt, alltype, saleonly)
Name ActivityType FundType UniqueFunds.AllTypes. UniqueFunds.SaleOnly. ind alltype saleonly
1: John Email a 1 0 1 1 0
2: John Sale a;b 2 2 2 2 2
3: John Webinar c;d 4 2 3 4 2
4: John Sale b 4 2 4 4 2
5: John Webinar e 5 2 5 5 2
6: John Conference b;d 5 2 6 5 2
7: John Sale b;e 5 3 7 5 3
8: Tom Email a 1 0 1 1 0
9: Tom Sale a;b 2 2 2 2 2
10: Tom Webinar c;d 4 2 3 4 2
11: Tom Sale b 4 2 4 4 2
12: Tom Webinar e 5 2 5 5 2
13: Tom Conference b;d 5 2 6 5 2
14: Tom Sale b;e;f 6 4 7 6 4
EDIT
For the new data set, I tried the following. Basically, I used my approach for the saleonly data to this new data set. The revision was only in the alltype
part. First, I added indices, replaced "" with NA, and subsetted the data with rows having non-NA values. This is temp
. The rest is identical to the previous answer. Now I wanted to have the data set with NAs in FundType, so I used setdiff()
. Using rbindlist()
, I combined the two data sets and created temp
. The rest is identical to the previous answer. The sale-part does not have any changes. I hope this will work for your real data.
### all type
setDT(dt)[, ind := 1:.N, by = "Name"][,
FundType := replace(FundType, which(FundType == ""), NA)][FundType != ""] -> temp
cSplit(temp, "FundType", sep = ";", direction = "long")[,
alltype := cumsum(!duplicated(FundType)), by = "Name"] -> alltype
whatever <- list(setdiff(dt, temp), alltype)
rbindlist(whatever, use.names = TRUE, fill = TRUE) -> temp
setorder(temp, Name, ind)[,.SD[.N], by = c("Name", "ind")][,
alltype := na.locf(alltype, na.rm = FALSE), by = "Name"][,
alltype := replace(alltype, is.na(alltype), 0)][, list(alltype)] -> alltype
### sale only
cSplit(dt, "FundType", sep = ";", direction = "long")[
ActivityType == "Sale"][,
saleonly := cumsum(!duplicated(FundType)), by = "Name"] -> ana
cSplit(dt, "FundType", sep = ";", direction = "long")[
ActivityType != "Sale"] -> ana2
l <- list(ana, ana2)
rbindlist(l, use.names = TRUE, fill = TRUE) -> temp
setorder(temp, Name, ind)[,
.SD[.N], by = c("Name", "ind")][,
saleonly := na.locf(saleonly, na.rm = FALSE), by = "Name"][,
saleonly := replace(saleonly, is.na(saleonly), 0)][, list(saleonly)] -> saleonly
cbind(dt, alltype, saleonly)
Name ActivityType FundType UniqueFunds.AllTypes. UniqueFunds.SaleOnly. ind alltype saleonly
1: John Email NA 0 0 1 0 0
2: John Conference NA 0 0 2 0 0
3: John Email a 1 0 3 1 0
4: John Sale a;b 2 2 4 2 2
5: John Webinar c;d 4 2 5 4 2
6: John Sale b 4 2 6 4 2
7: John Webinar e 5 2 7 5 2
8: John Conference b;d 5 2 8 5 2
9: John Sale b;e 5 3 9 5 3
10: John Email NA 5 3 10 5 3
11: John Webinar NA 5 3 11 5 3
12: Tom Email a 1 0 1 1 0
13: Tom Sale a;b 2 2 2 2 2
14: Tom Webinar c;d 4 2 3 4 2
15: Tom Sale b 4 2 4 4 2
16: Tom Webinar e 5 2 5 5 2
17: Tom Conference b;d 5 2 6 5 2
18: Tom Sale b;e;f 6 4 7 6 4
I think this is one way to achieve what you are after. Start by adding an auxiliary index variable for maintaining the input order; and key
ing on Name
:
Dt <- copy(dt[, 1:3, with = FALSE])[, gIdx := 1:.N, by = "Name"]
setkeyv(Dt, "Name")
For clarity, I used this function
n_usplit <- function(x, spl = ";") length(unique(unlist(strsplit(x, split = spl))))
rather than typing out the body's expression on the fly - the operation below is dense enough as it is without a bunch of nested function calls convoluting things.
And finally,
Dt[Dt, allow.cartesian = TRUE][
gIdx <= i.gIdx,
.("UniqueFunds(AllTypes)" = n_usplit(FundType),
"UniqueFunds(SaleOnly)" = n_usplit(FundType[ActivityType == "Sale"])),
keyby = "Name,i.gIdx,i.ActivityType,i.FundType"][,-2, with = FALSE]
# Name i.ActivityType i.FundType UniqueFunds(AllTypes) UniqueFunds(SaleOnly)
# 1: John Email a 1 0
# 2: John Sale a;b 2 2
# 3: John Webinar c;d 4 2
# 4: John Sale b 4 2
# 5: John Webinar e 5 2
# 6: John Conference b;d 5 2
# 7: John Sale b;e 5 3
# 8: Tom Email a 1 0
# 9: Tom Sale a;b 2 2
# 10: Tom Webinar c;d 4 2
# 11: Tom Sale b 4 2
# 12: Tom Webinar e 5 2
# 13: Tom Conference b;d 5 2
# 14: Tom Sale b;e;f 6 4
I feel like I could explain this easier with SQL, but here we go:
Dt
on itself (by Name
) gIdx
), only consider previous (inclusive) rows in sequence - this results in a sort of cumulative effect (for lack of better term)UniqueFunds(...)
columns - notice the extra subsetting done in the second case - n_usplit(FundType[ActivityType == "Sale"])
i.gIdx
). I'm not sure how this will scale due to the use of a cartesian join, so hopefully your real data set isn't millions of rows.
Data:
library(data.table)
##
dt <- fread('
Name ActivityType FundType UniqueFunds(AllTypes) UniqueFunds(SaleOnly)
John Email a 1 0
John Sale a;b 2 2
John Webinar c;d 4 2
John Sale b 4 2
John Webinar e 5 2
John Conference b;d 5 2
John Sale b;e 5 3
Tom Email a 1 0
Tom Sale a;b 2 2
Tom Webinar c;d 4 2
Tom Sale b 4 2
Tom Webinar e 5 2
Tom Conference b;d 5 2
Tom Sale b;e;f 6 4
', header = TRUE)
I achieved what you're looking for as follows:
library(data.table)
library(stringr)
dt <- data.table(read.table(text='
Name ActivityType FundType UniqueFunds(AllTypes) UniqueFunds(SaleOnly)
John Email a 1 0
John Sale a;b 2 2
John Webinar c;d 4 2
John Sale b 4 2
John Webinar e 5 2
John Conference b;d 5 2
John Sale b;e 5 3
Tom Email a 1 0
Tom Sale a;b 2 2
Tom Webinar c;d 4 2
Tom Sale b 4 2
Tom Webinar e 5 2
Tom Conference b;d 5 2
Tom Sale b;e;f 6 4
', header=T, row.names = NULL))
dt[,UniqueFunds.AllTypes. := NULL][,UniqueFunds.SaleOnly. := NULL]
#Get the different Fund Types
vals <- unique(unlist(str_extract_all(dt$FundType,"[a-z]")))
#Construct a new set of columns indicating which fund types are present
dt[,vals:=data.table(1*t(sapply(FundType,str_detect,vals))),with=FALSE]
#Calculate UniqueFunds.AllTypes
dt[, UniqueFunds.AllTypes. :=
rowSums(sapply(.SD, cummax)), .SDcols = vals, by = Name]
#Calculate only when ActicityType == "Sale" and use cummax to achieve desired output
dt[,UniqueFunds.SaleOnly. := 0
][ActivityType == "Sale", UniqueFunds.SaleOnly. :=
rowSums(sapply(.SD, cummax)), .SDcols = vals, by = Name
][,UniqueFunds.SaleOnly. := cummax(UniqueFunds.SaleOnly.), by = Name
]
#Cleanup vals
dt[,vals := NULL, with = FALSE]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With