MyData:
DTX<-data.table(
CX1 = c(NA,NA,NA, "Account1", "Account2", NA, "Account3", "Account1", "Account2","Account1","Account1","Account1"),
CX2 = c(NA,"Account1B",NA,NA,NA,NA,NA,NA,NA,"Account1B","Account1B","Account1B"),
CX1I = c(0,0,0, 1, 1, 0, 1, 1, 1,1,1,1),
CX2I = c(0,1,0, 0, 0, 0, 0, 0, 0,1,1,1))
DTX;
Requirement:
CX1
such that, if it is NA, assign a unique count; otherwise, if it has a value, assign the same count for each of its occurrences.userinput<-c("CX1")
userinput<-c("CX1","CX2")
DTX
is an input from another source so all the blanks are converted to NA when imported.CX1I
and CX2I
are identifying the respective columns as integers, indicating whether they have a value or not. They can be used or ignored, depending on the implementation in the required code(better).My Solution(wrong):
userinput<-c("CX1")
DTX[, UniqueID_CX1 := ifelse(CX1I == 0, .I, .GRP), by = .(get(userinput))]
Ideal solution would be:
Edit : Better reproducible data.
library(data.table)
rowcount<-9e6;
set.seed(1)
DTX <- data.table(
CX1 = sample(c(NA, "Account1", "Account2", "Account3"), rowcount, replace = TRUE, prob = c(0.9, 0.02, 0.05, 0.3)),
CX2 = sample(c(NA, "Account1B", "Account2B", "Account3B"), rowcount, replace = TRUE, prob = c(0.95, 0.01, 0.02, 0.02))
)
DTX
I'll demonstrate using a list for userinputs
to show that it works with 1 or more variables. For your use, if it will always be exactly one of them (just one vector of 1+ names), feel free to drop the for
-loop, that's not a requirement for the inner code to work.
DTX <- data.table::as.data.table(structure(list(CX1 = c(NA, NA, NA, "Account1", "Account2", NA, "Account3", "Account1", "Account2", "Account1", "Account1", "Account1"), CX2 = c(NA, "Account1B", NA, NA, NA, NA, NA, NA, NA, "Account1B", "Account1B", "Account1B"), CX1I = c(0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1), CX2I = c(0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1)), row.names = c(NA, -12L), class = c("data.table", "data.frame")))
userinputs <- list("CX1", "CX2", c("CX1", "CX2"))
for (userinput in userinputs) {
# if you're doing this just once with one `userinput`,
# use just this inner code block
nm <- paste(c("UniqueID", userinput), collapse = "_")
newdt <- unique(DTX[, .SD, .SDcols = userinput])
newdt[, c("x", "allna") := .(.I, rowSums(!is.na(newdt)) == 0)]
DTX[newdt, c(".x", ".allna") := .(as.numeric(i.x), i.allna), on = userinput] |>
_[(.allna), .x := .x + (seq_len(.N)-1)/.N, by = userinput] |>
_[, c(".x", ".allna") := .(frank(.x, ties.method = "dense", na.last = FALSE), NULL) ] |>
setnames(".x", nm)
}
DTX
# CX1 CX2 CX1I CX2I UniqueID_CX1 UniqueID_CX2 UniqueID_CX1_CX2
# <char> <char> <num> <num> <int> <int> <int>
# 1: <NA> <NA> 0 0 1 1 1
# 2: <NA> Account1B 0 1 2 9 4
# 3: <NA> <NA> 0 0 3 2 2
# 4: Account1 <NA> 1 0 5 3 5
# 5: Account2 <NA> 1 0 6 4 6
# 6: <NA> <NA> 0 0 4 5 3
# 7: Account3 <NA> 1 0 7 6 7
# 8: Account1 <NA> 1 0 5 7 5
# 9: Account2 <NA> 1 0 6 8 6
# 10: Account1 Account1B 1 1 5 9 8
# 11: Account1 Account1B 1 1 5 9 8
# 12: Account1 Account1B 1 1 5 9 8
For simplicity (and technically "speed"), I make no attempt to order the IDs for the same results you show in your sample. I think this should be fine, since you're looking for uniqueness. If order is important, it should be addressed on the newdt
frame before joining back into DTX
.
I used temp variables .x
and .allna
to make this agnostic to the version of data.table
, assuming that they would not exist in your real data. If you are concerned about collisions, then you can leverage the env=
programmatic interface introduced in data.table_1.15.0
for both tempvars after coming up with names that are confirmed not in the original dataset. (I'll leave that as an exercise if it is important.)
As you suggested, an appealing alternative may be:
DTX[, (paste(c("UniqueID", userinput), collapse = "_")) := if (all(is.na(.BY))) .I else .GRP, by = userinput]
Unfortunately, that does not guarantee exclusive sets for .I
and .GRP
. As a simple example, let's change row 2 to be all-NA
:
userinput <- c("CX1", "CX2")
### original DTX
DTX[2, CX2 := NA]
(allna <- which(rowSums(!is.na(DTX[, 1:2])) == 0))
# [1] 1 2 3 6
### confirmation of .I
DTX[, if (is.na(CX1) && is.na(CX2)) .I, by = userinput]
# CX1 CX2 V1
# <char> <char> <int>
# 1: <NA> <NA> 1
# 2: <NA> <NA> 2
# 3: <NA> <NA> 3
# 4: <NA> <NA> 6
### we should not see any `TRUE` rows with non-NA CX1/CX2
DTX[, any(allna %in% .GRP), by = userinput]
# CX1 CX2 V1
# <char> <char> <lgcl>
# 1: <NA> <NA> TRUE
# 2: Account1 <NA> TRUE
# 3: Account2 <NA> TRUE
# 4: Account3 <NA> FALSE
# 5: Account1 Account1B FALSE
This means when we're grouping by both (in this case), then at least one of the .GRP
values is one of the values found in allna
.
Empirically,
DTX[, (paste(c("UniqueID", userinput), collapse = "_")) := if (all(is.na(.BY))) .I else .GRP, by = userinput]
# CX1 CX2 CX1I CX2I UniqueID_CX1_CX2
# <char> <char> <num> <num> <int>
# 1: <NA> <NA> 0 0 1
# 2: <NA> <NA> 0 1 2
# 3: <NA> <NA> 0 0 3
# 4: Account1 <NA> 1 0 2
# 5: Account2 <NA> 1 0 3
# 6: <NA> <NA> 0 0 6
# 7: Account3 <NA> 1 0 4
# 8: Account1 <NA> 1 0 2
# 9: Account2 <NA> 1 0 3
# 10: Account1 Account1B 1 1 5
# 11: Account1 Account1B 1 1 5
# 12: Account1 Account1B 1 1 5
Where the UniqueID_CX1_CX2
rows 2, 4, and 8 share the same "UniqueID" 2
but have differing CX1/CX2. This also fails with (say) userinput <- "CX1"
(I'll leave it to you to verify this, no need to dump it here).
Going back to my first code, though,
userinput <- c("CX1", "CX2")
### original DTX
DTX[2, CX2 := NA]
nm <- paste(c("UniqueID", userinput), collapse = "_")
newdt <- unique(DTX[, .SD, .SDcols = userinput])
newdt[, c("x", "allna") := .(.I, rowSums(!is.na(newdt)) == 0)]
DTX[newdt, c(".x", ".allna") := .(as.numeric(i.x), i.allna), on = userinput] |>
_[(.allna), .x := .x + (seq_len(.N)-1)/.N, by = userinput] |>
_[, c(".x", ".allna") := .(frank(.x, ties.method = "dense", na.last = FALSE), NULL) ] |>
setnames(".x", nm)
DTX
# CX1 CX2 CX1I CX2I UniqueID_CX1_CX2
# <char> <char> <num> <num> <int>
# 1: <NA> <NA> 0 0 1
# 2: <NA> <NA> 0 1 2
# 3: <NA> <NA> 0 0 3
# 4: Account1 <NA> 1 0 5
# 5: Account2 <NA> 1 0 6
# 6: <NA> <NA> 0 0 4
# 7: Account3 <NA> 1 0 7
# 8: Account1 <NA> 1 0 5
# 9: Account2 <NA> 1 0 6
# 10: Account1 Account1B 1 1 8
# 11: Account1 Account1B 1 1 8
# 12: Account1 Account1B 1 1 8
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With