Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Creating a unique ID in a data.table with a condition

Tags:

r

data.table

MyData:

DTX<-data.table(
  CX1 = c(NA,NA,NA, "Account1", "Account2", NA, "Account3", "Account1", "Account2","Account1","Account1","Account1"),
  CX2 = c(NA,"Account1B",NA,NA,NA,NA,NA,NA,NA,"Account1B","Account1B","Account1B"),
  CX1I = c(0,0,0, 1, 1, 0, 1, 1, 1,1,1,1),
  CX2I =  c(0,1,0, 0, 0, 0, 0, 0, 0,1,1,1))
DTX;

Requirement:

  1. Create a unique ID based on column CX1 such that, if it is NA, assign a unique count; otherwise, if it has a value, assign the same count for each of its occurrences.
userinput<-c("CX1")
  1. This should be dynamic for any column provided as input by the user. It can be one or many columns, so a generalization is required.
userinput<-c("CX1","CX2")
  1. Please I prefer only data.table.
  2. This will be used on a very large dataset so it must be fast and efficient.
  3. DTX is an input from another source so all the blanks are converted to NA when imported.
  4. CX1I and CX2I are identifying the respective columns as integers, indicating whether they have a value or not. They can be used or ignored, depending on the implementation in the required code(better).

My Solution(wrong):

userinput<-c("CX1")
DTX[, UniqueID_CX1 := ifelse(CX1I == 0,  .I, .GRP), by = .(get(userinput))]

Ideal solution would be: enter image description here


Edit : Better reproducible data.

library(data.table)

rowcount<-9e6;
set.seed(1)
DTX <- data.table(
    CX1  = sample(c(NA, "Account1", "Account2", "Account3"), rowcount, replace = TRUE, prob = c(0.9, 0.02, 0.05, 0.3)),
    CX2  = sample(c(NA, "Account1B", "Account2B", "Account3B"), rowcount, replace = TRUE, prob = c(0.95, 0.01, 0.02, 0.02))
  )
DTX
like image 961
Mohit Avatar asked Oct 20 '25 13:10

Mohit


1 Answers

I'll demonstrate using a list for userinputs to show that it works with 1 or more variables. For your use, if it will always be exactly one of them (just one vector of 1+ names), feel free to drop the for-loop, that's not a requirement for the inner code to work.

DTX <- data.table::as.data.table(structure(list(CX1 = c(NA, NA, NA, "Account1", "Account2", NA, "Account3", "Account1", "Account2", "Account1", "Account1", "Account1"), CX2 = c(NA, "Account1B", NA, NA, NA, NA, NA, NA, NA, "Account1B", "Account1B", "Account1B"), CX1I = c(0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1), CX2I = c(0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1)), row.names = c(NA, -12L), class = c("data.table", "data.frame")))

userinputs <- list("CX1", "CX2", c("CX1", "CX2"))
for (userinput in userinputs) {

  # if you're doing this just once with one `userinput`,
  # use just this inner code block
  nm <- paste(c("UniqueID", userinput), collapse = "_")
  newdt <- unique(DTX[, .SD, .SDcols = userinput])
  newdt[, c("x", "allna") := .(.I, rowSums(!is.na(newdt)) == 0)]
  DTX[newdt, c(".x", ".allna") := .(as.numeric(i.x), i.allna), on = userinput] |>
    _[(.allna), .x := .x + (seq_len(.N)-1)/.N, by = userinput] |>
    _[, c(".x", ".allna") := .(frank(.x, ties.method = "dense", na.last = FALSE), NULL) ] |>
    setnames(".x", nm)

}

DTX
#          CX1       CX2  CX1I  CX2I UniqueID_CX1 UniqueID_CX2 UniqueID_CX1_CX2
#       <char>    <char> <num> <num>        <int>        <int>            <int>
#  1:     <NA>      <NA>     0     0            1            1                1
#  2:     <NA> Account1B     0     1            2            9                4
#  3:     <NA>      <NA>     0     0            3            2                2
#  4: Account1      <NA>     1     0            5            3                5
#  5: Account2      <NA>     1     0            6            4                6
#  6:     <NA>      <NA>     0     0            4            5                3
#  7: Account3      <NA>     1     0            7            6                7
#  8: Account1      <NA>     1     0            5            7                5
#  9: Account2      <NA>     1     0            6            8                6
# 10: Account1 Account1B     1     1            5            9                8
# 11: Account1 Account1B     1     1            5            9                8
# 12: Account1 Account1B     1     1            5            9                8

For simplicity (and technically "speed"), I make no attempt to order the IDs for the same results you show in your sample. I think this should be fine, since you're looking for uniqueness. If order is important, it should be addressed on the newdt frame before joining back into DTX.

I used temp variables .x and .allna to make this agnostic to the version of data.table, assuming that they would not exist in your real data. If you are concerned about collisions, then you can leverage the env= programmatic interface introduced in data.table_1.15.0 for both tempvars after coming up with names that are confirmed not in the original dataset. (I'll leave that as an exercise if it is important.)


As you suggested, an appealing alternative may be:

DTX[, (paste(c("UniqueID", userinput), collapse = "_")) := if (all(is.na(.BY))) .I else .GRP, by = userinput]

Unfortunately, that does not guarantee exclusive sets for .I and .GRP. As a simple example, let's change row 2 to be all-NA:

userinput <- c("CX1", "CX2")
### original DTX
DTX[2, CX2 := NA]

(allna <- which(rowSums(!is.na(DTX[, 1:2])) == 0))
# [1] 1 2 3 6

### confirmation of .I
DTX[, if (is.na(CX1) && is.na(CX2)) .I, by = userinput]
#       CX1    CX2    V1
#    <char> <char> <int>
# 1:   <NA>   <NA>     1
# 2:   <NA>   <NA>     2
# 3:   <NA>   <NA>     3
# 4:   <NA>   <NA>     6

### we should not see any `TRUE` rows with non-NA CX1/CX2
DTX[, any(allna %in% .GRP), by = userinput]
#         CX1       CX2     V1
#      <char>    <char> <lgcl>
# 1:     <NA>      <NA>   TRUE
# 2: Account1      <NA>   TRUE
# 3: Account2      <NA>   TRUE
# 4: Account3      <NA>  FALSE
# 5: Account1 Account1B  FALSE

This means when we're grouping by both (in this case), then at least one of the .GRP values is one of the values found in allna.

Empirically,

DTX[, (paste(c("UniqueID", userinput), collapse = "_")) := if (all(is.na(.BY))) .I else .GRP, by = userinput]
#          CX1       CX2  CX1I  CX2I UniqueID_CX1_CX2
#       <char>    <char> <num> <num>            <int>
#  1:     <NA>      <NA>     0     0                1
#  2:     <NA>      <NA>     0     1                2
#  3:     <NA>      <NA>     0     0                3
#  4: Account1      <NA>     1     0                2
#  5: Account2      <NA>     1     0                3
#  6:     <NA>      <NA>     0     0                6
#  7: Account3      <NA>     1     0                4
#  8: Account1      <NA>     1     0                2
#  9: Account2      <NA>     1     0                3
# 10: Account1 Account1B     1     1                5
# 11: Account1 Account1B     1     1                5
# 12: Account1 Account1B     1     1                5

Where the UniqueID_CX1_CX2 rows 2, 4, and 8 share the same "UniqueID" 2 but have differing CX1/CX2. This also fails with (say) userinput <- "CX1" (I'll leave it to you to verify this, no need to dump it here).

Going back to my first code, though,

userinput <- c("CX1", "CX2")
### original DTX
DTX[2, CX2 := NA]

  nm <- paste(c("UniqueID", userinput), collapse = "_")
  newdt <- unique(DTX[, .SD, .SDcols = userinput])
  newdt[, c("x", "allna") := .(.I, rowSums(!is.na(newdt)) == 0)]
  DTX[newdt, c(".x", ".allna") := .(as.numeric(i.x), i.allna), on = userinput] |>
    _[(.allna), .x := .x + (seq_len(.N)-1)/.N, by = userinput] |>
    _[, c(".x", ".allna") := .(frank(.x, ties.method = "dense", na.last = FALSE), NULL) ] |>
    setnames(".x", nm)

DTX
#          CX1       CX2  CX1I  CX2I UniqueID_CX1_CX2
#       <char>    <char> <num> <num>            <int>
#  1:     <NA>      <NA>     0     0                1
#  2:     <NA>      <NA>     0     1                2
#  3:     <NA>      <NA>     0     0                3
#  4: Account1      <NA>     1     0                5
#  5: Account2      <NA>     1     0                6
#  6:     <NA>      <NA>     0     0                4
#  7: Account3      <NA>     1     0                7
#  8: Account1      <NA>     1     0                5
#  9: Account2      <NA>     1     0                6
# 10: Account1 Account1B     1     1                8
# 11: Account1 Account1B     1     1                8
# 12: Account1 Account1B     1     1                8
like image 190
r2evans Avatar answered Oct 22 '25 03:10

r2evans



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!