R: Creating a unique ID in a data.table with a condition

Question

MyData:

DTX<-data.table(
  CX1 = c(NA,NA,NA, "Account1", "Account2", NA, "Account3", "Account1", "Account2","Account1","Account1","Account1"),
  CX2 = c(NA,"Account1B",NA,NA,NA,NA,NA,NA,NA,"Account1B","Account1B","Account1B"),
  CX1I = c(0,0,0, 1, 1, 0, 1, 1, 1,1,1,1),
  CX2I =  c(0,1,0, 0, 0, 0, 0, 0, 0,1,1,1))
DTX;

Requirement:

Create a unique ID based on column CX1 such that, if it is NA, assign a unique count; otherwise, if it has a value, assign the same count for each of its occurrences.

userinput<-c("CX1")

This should be dynamic for any column provided as input by the user. It can be one or many columns, so a generalization is required.

userinput<-c("CX1","CX2")

Please I prefer only data.table.
This will be used on a very large dataset so it must be fast and efficient.
DTX is an input from another source so all the blanks are converted to NA when imported.
CX1I and CX2I are identifying the respective columns as integers, indicating whether they have a value or not. They can be used or ignored, depending on the implementation in the required code(better).

My Solution(wrong):

userinput<-c("CX1")
DTX[, UniqueID_CX1 := ifelse(CX1I == 0,  .I, .GRP), by = .(get(userinput))]

Ideal solution would be: enter image description here

Edit : Better reproducible data.

library(data.table)

rowcount<-9e6;
set.seed(1)
DTX <- data.table(
    CX1  = sample(c(NA, "Account1", "Account2", "Account3"), rowcount, replace = TRUE, prob = c(0.9, 0.02, 0.05, 0.3)),
    CX2  = sample(c(NA, "Account1B", "Account2B", "Account3B"), rowcount, replace = TRUE, prob = c(0.95, 0.01, 0.02, 0.02))
  )
DTX

r2evans · Accepted Answer

I'll demonstrate using a list for userinputs to show that it works with 1 or more variables. For your use, if it will always be exactly one of them (just one vector of 1+ names), feel free to drop the for-loop, that's not a requirement for the inner code to work.

DTX <- data.table::as.data.table(structure(list(CX1 = c(NA, NA, NA, "Account1", "Account2", NA, "Account3", "Account1", "Account2", "Account1", "Account1", "Account1"), CX2 = c(NA, "Account1B", NA, NA, NA, NA, NA, NA, NA, "Account1B", "Account1B", "Account1B"), CX1I = c(0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1), CX2I = c(0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1)), row.names = c(NA, -12L), class = c("data.table", "data.frame")))

userinputs <- list("CX1", "CX2", c("CX1", "CX2"))
for (userinput in userinputs) {

  # if you're doing this just once with one `userinput`,
  # use just this inner code block
  nm <- paste(c("UniqueID", userinput), collapse = "_")
  newdt <- unique(DTX[, .SD, .SDcols = userinput])
  newdt[, c("x", "allna") := .(.I, rowSums(!is.na(newdt)) == 0)]
  DTX[newdt, c(".x", ".allna") := .(as.numeric(i.x), i.allna), on = userinput] |>
    _[(.allna), .x := .x + (seq_len(.N)-1)/.N, by = userinput] |>
    _[, c(".x", ".allna") := .(frank(.x, ties.method = "dense", na.last = FALSE), NULL) ] |>
    setnames(".x", nm)

}

DTX
#          CX1       CX2  CX1I  CX2I UniqueID_CX1 UniqueID_CX2 UniqueID_CX1_CX2
#       <char>    <char> <num> <num>        <int>        <int>            <int>
#  1:     <NA>      <NA>     0     0            1            1                1
#  2:     <NA> Account1B     0     1            2            9                4
#  3:     <NA>      <NA>     0     0            3            2                2
#  4: Account1      <NA>     1     0            5            3                5
#  5: Account2      <NA>     1     0            6            4                6
#  6:     <NA>      <NA>     0     0            4            5                3
#  7: Account3      <NA>     1     0            7            6                7
#  8: Account1      <NA>     1     0            5            7                5
#  9: Account2      <NA>     1     0            6            8                6
# 10: Account1 Account1B     1     1            5            9                8
# 11: Account1 Account1B     1     1            5            9                8
# 12: Account1 Account1B     1     1            5            9                8

For simplicity (and technically "speed"), I make no attempt to order the IDs for the same results you show in your sample. I think this should be fine, since you're looking for uniqueness. If order is important, it should be addressed on the newdt frame before joining back into DTX.

I used temp variables .x and .allna to make this agnostic to the version of data.table, assuming that they would not exist in your real data. If you are concerned about collisions, then you can leverage the env= programmatic interface introduced in data.table_1.15.0 for both tempvars after coming up with names that are confirmed not in the original dataset. (I'll leave that as an exercise if it is important.)

As you suggested, an appealing alternative may be:

DTX[, (paste(c("UniqueID", userinput), collapse = "_")) := if (all(is.na(.BY))) .I else .GRP, by = userinput]

Unfortunately, that does not guarantee exclusive sets for .I and .GRP. As a simple example, let's change row 2 to be all-NA:

userinput <- c("CX1", "CX2")
### original DTX
DTX[2, CX2 := NA]

(allna <- which(rowSums(!is.na(DTX[, 1:2])) == 0))
# [1] 1 2 3 6

### confirmation of .I
DTX[, if (is.na(CX1) && is.na(CX2)) .I, by = userinput]
#       CX1    CX2    V1
#    <char> <char> <int>
# 1:   <NA>   <NA>     1
# 2:   <NA>   <NA>     2
# 3:   <NA>   <NA>     3
# 4:   <NA>   <NA>     6

### we should not see any `TRUE` rows with non-NA CX1/CX2
DTX[, any(allna %in% .GRP), by = userinput]
#         CX1       CX2     V1
#      <char>    <char> <lgcl>
# 1:     <NA>      <NA>   TRUE
# 2: Account1      <NA>   TRUE
# 3: Account2      <NA>   TRUE
# 4: Account3      <NA>  FALSE
# 5: Account1 Account1B  FALSE

This means when we're grouping by both (in this case), then at least one of the .GRP values is one of the values found in allna.

Empirically,

DTX[, (paste(c("UniqueID", userinput), collapse = "_")) := if (all(is.na(.BY))) .I else .GRP, by = userinput]
#          CX1       CX2  CX1I  CX2I UniqueID_CX1_CX2
#       <char>    <char> <num> <num>            <int>
#  1:     <NA>      <NA>     0     0                1
#  2:     <NA>      <NA>     0     1                2
#  3:     <NA>      <NA>     0     0                3
#  4: Account1      <NA>     1     0                2
#  5: Account2      <NA>     1     0                3
#  6:     <NA>      <NA>     0     0                6
#  7: Account3      <NA>     1     0                4
#  8: Account1      <NA>     1     0                2
#  9: Account2      <NA>     1     0                3
# 10: Account1 Account1B     1     1                5
# 11: Account1 Account1B     1     1                5
# 12: Account1 Account1B     1     1                5

Where the UniqueID_CX1_CX2 rows 2, 4, and 8 share the same "UniqueID" 2 but have differing CX1/CX2. This also fails with (say) userinput <- "CX1" (I'll leave it to you to verify this, no need to dump it here).

Going back to my first code, though,

userinput <- c("CX1", "CX2")
### original DTX
DTX[2, CX2 := NA]

  nm <- paste(c("UniqueID", userinput), collapse = "_")
  newdt <- unique(DTX[, .SD, .SDcols = userinput])
  newdt[, c("x", "allna") := .(.I, rowSums(!is.na(newdt)) == 0)]
  DTX[newdt, c(".x", ".allna") := .(as.numeric(i.x), i.allna), on = userinput] |>
    _[(.allna), .x := .x + (seq_len(.N)-1)/.N, by = userinput] |>
    _[, c(".x", ".allna") := .(frank(.x, ties.method = "dense", na.last = FALSE), NULL) ] |>
    setnames(".x", nm)

DTX
#          CX1       CX2  CX1I  CX2I UniqueID_CX1_CX2
#       <char>    <char> <num> <num>            <int>
#  1:     <NA>      <NA>     0     0                1
#  2:     <NA>      <NA>     0     1                2
#  3:     <NA>      <NA>     0     0                3
#  4: Account1      <NA>     1     0                5
#  5: Account2      <NA>     1     0                6
#  6:     <NA>      <NA>     0     0                4
#  7: Account3      <NA>     1     0                7
#  8: Account1      <NA>     1     0                5
#  9: Account2      <NA>     1     0                6
# 10: Account1 Account1B     1     1                8
# 11: Account1 Account1B     1     1                8
# 12: Account1 Account1B     1     1                8

R: Creating a unique ID in a data.table with a condition

Tags:

r

data.table

Mohit

1 Answers

r2evans

Recent Activity

Donate For Us

R: Creating a unique ID in a data.table with a condition

Tags:

r

data.table

Mohit

1 Answers

r2evans

Related questions

Recent Activity

Donate For Us