Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count cumulative unique factors separated by semicolon Grouped by Name

This is what my dataframe looks like. The two rightmost columns are my desired columns. I am counting the cumulative number of unique FundTypes as of each row.The 4th columns is the cumulative unique count for all "ActivityType" and the 5th column is the cumulative unique count for only "ActivityType=="Sale".

dt <- read.table(text='

Name      ActivityType     FundType  UniqueFunds(AllTypes) UniqueFunds(SaleOnly)         

John       Email               a            1                     0
John       Sale                a;b          2                     2 
John       Webinar             c;d          4                     2
John       Sale                b            4                     2
John       Webinar             e            5                     2
John       Conference          b;d          5                     2
John       Sale                b;e          5                     3
Tom        Email               a            1                     0
Tom        Sale                a;b          2                     2 
Tom        Webinar             c;d          4                     2
Tom        Sale                b            4                     2
Tom        Webinar             e            5                     2
Tom        Conference          b;d          5                     2
Tom        Sale                b;e;f        6                     4                    

                         ', header=T, row.names = NULL)

I have tried dt[, UniqueFunds := cumsum(!duplicated(FundType)& !FundType=="") ,by = Name] but for example it counts a & a;b & c;d as 3 unique values as opposed to the desired 4 unique values as the factors are separated by semicolon.Kindly let me know of a solution.

UPDATE: My real dataset looks more like this:

dt <- read.table(text='

    Name      ActivityType     FundType  UniqueFunds(AllTypes) UniqueFunds(SaleOnly)         
    John       Email               ""           0                     0
    John       Conference          ""           0                     0
    John       Email               a            1                     0
    John       Sale                a;b          2                     2 
    John       Webinar             c;d          4                     2
    John       Sale                b            4                     2
    John       Webinar             e            5                     2
    John       Conference          b;d          5                     2
    John       Sale                b;e          5                     3
    John       Email               ""           5                     3
    John       Webinar             ""           5                     3
    Tom        Email               a            1                     0
    Tom        Sale                a;b          2                     2 
    Tom        Webinar             c;d          4                     2
    Tom        Sale                b            4                     2
    Tom        Webinar             e            5                     2
    Tom        Conference          b;d          5                     2
    Tom        Sale                b;e;f        6                     4                    

                             ', header=T, row.names = NULL)

The unique cumulative vectors need to take into account the missing values.

like image 237
gibbz00 Avatar asked Dec 28 '15 23:12

gibbz00


3 Answers

nrussell suggested a concise solution writing a custom function. Let me drop what I got. I tried to used cumsum() and duplicated() as you tried. I did two major operations. One for alltype and the other for saleonly. First, I created indices for each name. Then, I split FundType and formatted the data in a long format with cSplit() from the splitstackshape package. Then, I chose the last row for each index number for each Name. Finally, I chose only one column, alltype.

library(splitstackshape)
library(zoo)
library(data.table)

setDT(dt)[, ind := 1:.N, by = "Name"]
cSplit(dt, "FundType", sep = ";", direction = "long")[,
    alltype := cumsum(!duplicated(FundType)), by = "Name"][,
    .SD[.N], by = c("Name", "ind")][, list(alltype)] -> alltype

The second operation was for saleonly. Basically, I repeated the same approach to a subsetted data for sale, which is ana. I also created a data set without sale, which is ana2. Then, I created a list with the two data sets (i.e., l) and bound them. I changed the order of the dataset with Name and ind, take the last rows for each name and index number, taking care of NAs (filling NAs and replacing the first NA for each Name with 0), and finally chose one column. The final operation was to combine the original dt, alltype, and saleonly.

# data for sale only
cSplit(dt, "FundType", sep = ";", direction = "long")[
    ActivityType == "Sale"][,
    saleonly := cumsum(!duplicated(FundType)), by = "Name"] -> ana

# Data without sale
cSplit(dt, "FundType", sep = ";", direction = "long")[
    ActivityType != "Sale"] -> ana2 

# Combine ana and ana2
l <- list(ana, ana2)
rbindlist(l, use.names = TRUE, fill = TRUE) -> temp
setorder(temp, Name, ind)[,
    .SD[.N], by = c("Name", "ind")][,
    saleonly := na.locf(saleonly, na.rm = FALSE), by = "Name"][,
    saleonly := replace(saleonly, is.na(saleonly), 0)][, list(saleonly)] -> saleonly

cbind(dt, alltype, saleonly)

    Name ActivityType FundType UniqueFunds.AllTypes. UniqueFunds.SaleOnly. ind alltype saleonly
 1: John        Email        a                     1                     0   1       1        0
 2: John         Sale      a;b                     2                     2   2       2        2
 3: John      Webinar      c;d                     4                     2   3       4        2
 4: John         Sale        b                     4                     2   4       4        2
 5: John      Webinar        e                     5                     2   5       5        2
 6: John   Conference      b;d                     5                     2   6       5        2
 7: John         Sale      b;e                     5                     3   7       5        3
 8:  Tom        Email        a                     1                     0   1       1        0
 9:  Tom         Sale      a;b                     2                     2   2       2        2
10:  Tom      Webinar      c;d                     4                     2   3       4        2
11:  Tom         Sale        b                     4                     2   4       4        2
12:  Tom      Webinar        e                     5                     2   5       5        2
13:  Tom   Conference      b;d                     5                     2   6       5        2
14:  Tom         Sale    b;e;f                     6                     4   7       6        4

EDIT

For the new data set, I tried the following. Basically, I used my approach for the saleonly data to this new data set. The revision was only in the alltype part. First, I added indices, replaced "" with NA, and subsetted the data with rows having non-NA values. This is temp. The rest is identical to the previous answer. Now I wanted to have the data set with NAs in FundType, so I used setdiff(). Using rbindlist(), I combined the two data sets and created temp. The rest is identical to the previous answer. The sale-part does not have any changes. I hope this will work for your real data.

### all type

setDT(dt)[, ind := 1:.N, by = "Name"][,
    FundType := replace(FundType, which(FundType == ""), NA)][FundType != ""] -> temp
cSplit(temp, "FundType", sep = ";", direction = "long")[,
    alltype := cumsum(!duplicated(FundType)), by = "Name"] -> alltype


whatever <- list(setdiff(dt, temp), alltype)
rbindlist(whatever, use.names = TRUE, fill = TRUE) -> temp
setorder(temp, Name, ind)[,.SD[.N], by = c("Name", "ind")][,
    alltype := na.locf(alltype, na.rm = FALSE), by = "Name"][,
    alltype := replace(alltype, is.na(alltype), 0)][, list(alltype)] -> alltype


### sale only
cSplit(dt, "FundType", sep = ";", direction = "long")[
    ActivityType == "Sale"][,
    saleonly := cumsum(!duplicated(FundType)), by = "Name"] -> ana

cSplit(dt, "FundType", sep = ";", direction = "long")[
    ActivityType != "Sale"] -> ana2

l <- list(ana, ana2)
rbindlist(l, use.names = TRUE, fill = TRUE) -> temp
setorder(temp, Name, ind)[,
    .SD[.N], by = c("Name", "ind")][,
    saleonly := na.locf(saleonly, na.rm = FALSE), by = "Name"][,
    saleonly := replace(saleonly, is.na(saleonly), 0)][, list(saleonly)] -> saleonly

cbind(dt, alltype, saleonly)


    Name ActivityType FundType UniqueFunds.AllTypes. UniqueFunds.SaleOnly. ind alltype saleonly
 1: John        Email       NA                     0                     0   1       0        0
 2: John   Conference       NA                     0                     0   2       0        0
 3: John        Email        a                     1                     0   3       1        0
 4: John         Sale      a;b                     2                     2   4       2        2
 5: John      Webinar      c;d                     4                     2   5       4        2
 6: John         Sale        b                     4                     2   6       4        2
 7: John      Webinar        e                     5                     2   7       5        2
 8: John   Conference      b;d                     5                     2   8       5        2
 9: John         Sale      b;e                     5                     3   9       5        3
10: John        Email       NA                     5                     3  10       5        3
11: John      Webinar       NA                     5                     3  11       5        3
12:  Tom        Email        a                     1                     0   1       1        0
13:  Tom         Sale      a;b                     2                     2   2       2        2
14:  Tom      Webinar      c;d                     4                     2   3       4        2
15:  Tom         Sale        b                     4                     2   4       4        2
16:  Tom      Webinar        e                     5                     2   5       5        2
17:  Tom   Conference      b;d                     5                     2   6       5        2
18:  Tom         Sale    b;e;f                     6                     4   7       6        4
like image 199
jazzurro Avatar answered Nov 03 '22 16:11

jazzurro


I think this is one way to achieve what you are after. Start by adding an auxiliary index variable for maintaining the input order; and keying on Name:

Dt <- copy(dt[, 1:3, with = FALSE])[, gIdx := 1:.N, by = "Name"]
setkeyv(Dt, "Name") 

For clarity, I used this function

n_usplit <- function(x, spl = ";") length(unique(unlist(strsplit(x, split = spl)))) 

rather than typing out the body's expression on the fly - the operation below is dense enough as it is without a bunch of nested function calls convoluting things.

And finally,

Dt[Dt, allow.cartesian = TRUE][
  gIdx <= i.gIdx, 
  .("UniqueFunds(AllTypes)" = n_usplit(FundType),
    "UniqueFunds(SaleOnly)" = n_usplit(FundType[ActivityType == "Sale"])),
  keyby = "Name,i.gIdx,i.ActivityType,i.FundType"][,-2, with = FALSE]
#      Name i.ActivityType i.FundType UniqueFunds(AllTypes) UniqueFunds(SaleOnly)
# 1:   John          Email          a                     1                     0
# 2:   John           Sale        a;b                     2                     2
# 3:   John        Webinar        c;d                     4                     2
# 4:   John           Sale          b                     4                     2
# 5:   John        Webinar          e                     5                     2
# 6:   John     Conference        b;d                     5                     2
# 7:   John           Sale        b;e                     5                     3
# 8:    Tom          Email          a                     1                     0
# 9:    Tom           Sale        a;b                     2                     2
# 10:   Tom        Webinar        c;d                     4                     2
# 11:   Tom           Sale          b                     4                     2
# 12:   Tom        Webinar          e                     5                     2
# 13:   Tom     Conference        b;d                     5                     2
# 14:   Tom           Sale      b;e;f                     6                     4

I feel like I could explain this easier with SQL, but here we go:

  1. Join Dt on itself (by Name)
  2. Using the extra index column (gIdx), only consider previous (inclusive) rows in sequence - this results in a sort of cumulative effect (for lack of better term)
  3. Calculate the UniqueFunds(...) columns - notice the extra subsetting done in the second case - n_usplit(FundType[ActivityType == "Sale"])
  4. Drop the extraneous index column (i.gIdx).

I'm not sure how this will scale due to the use of a cartesian join, so hopefully your real data set isn't millions of rows.


Data:

library(data.table)
##
dt <- fread('
Name      ActivityType     FundType  UniqueFunds(AllTypes) UniqueFunds(SaleOnly)         
John       Email               a            1                     0
John       Sale                a;b          2                     2 
John       Webinar             c;d          4                     2
John       Sale                b            4                     2
John       Webinar             e            5                     2
John       Conference          b;d          5                     2
John       Sale                b;e          5                     3
Tom        Email               a            1                     0
Tom        Sale                a;b          2                     2 
Tom        Webinar             c;d          4                     2
Tom        Sale                b            4                     2
Tom        Webinar             e            5                     2
Tom        Conference          b;d          5                     2
Tom        Sale                b;e;f        6                     4                     
            ', header = TRUE)
like image 6
nrussell Avatar answered Nov 03 '22 15:11

nrussell


I achieved what you're looking for as follows:

library(data.table)
library(stringr)
dt <- data.table(read.table(text='

                 Name      ActivityType     FundType  UniqueFunds(AllTypes) UniqueFunds(SaleOnly)         
                 John       Email               a            1                     0
                 John       Sale                a;b          2                     2 
                 John       Webinar             c;d          4                     2
                 John       Sale                b            4                     2
                 John       Webinar             e            5                     2
                 John       Conference          b;d          5                     2
                 John       Sale                b;e          5                     3
                 Tom        Email               a            1                     0
                 Tom        Sale                a;b          2                     2 
                 Tom        Webinar             c;d          4                     2
                 Tom        Sale                b            4                     2
                 Tom        Webinar             e            5                     2
                 Tom        Conference          b;d          5                     2
                 Tom        Sale                b;e;f        6                     4                    

                 ', header=T, row.names = NULL))

dt[,UniqueFunds.AllTypes. := NULL][,UniqueFunds.SaleOnly. := NULL]

#Get the different Fund Types
vals <- unique(unlist(str_extract_all(dt$FundType,"[a-z]")))

#Construct a new set of columns indicating which fund types are present
dt[,vals:=data.table(1*t(sapply(FundType,str_detect,vals))),with=FALSE]

#Calculate UniqueFunds.AllTypes
dt[, UniqueFunds.AllTypes. := 
     rowSums(sapply(.SD, cummax)), .SDcols = vals, by = Name]

#Calculate only when ActicityType == "Sale" and use cummax to achieve desired output
dt[,UniqueFunds.SaleOnly. := 0
   ][ActivityType == "Sale", UniqueFunds.SaleOnly. := 
     rowSums(sapply(.SD, cummax)), .SDcols = vals, by = Name
   ][,UniqueFunds.SaleOnly. := cummax(UniqueFunds.SaleOnly.), by = Name
     ]

#Cleanup vals
dt[,vals := NULL, with = FALSE]
like image 2
NGaffney Avatar answered Nov 03 '22 16:11

NGaffney