Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Assign unique ID per multiple columns of data table

I would like to assign unique IDs to rows of a data table per multiple column values. Let's consider a simple example:

library(data.table)       
DT = data.table(a=c(4,2,NA,2,NA), b=c("a","b","c","b","c"), c=1:5)

    a b c
1:  4 a 1
2:  2 b 2
3: NA c 3
4:  2 b 4
5: NA c 5

I'd like to generate IDs based on columns a and b and expect to get three IDs where 2nd and 4th row IDs are identical and 3rd and 5th rows have the same ID as well.

I have seen two solutions but each are slightly incomplete:

1) Solution one requires sorted data table which is very cumbersome if we need to generate IDs per many columns (in my real application, IDs are generated based on about ten columns). Can we replace cumsum function so no sorting is required?

DT$ID1 <- cumsum(!duplicated(DT[,1:2]))

2) Solution two ignores NA values; while I'd like to include NAs and assign a group ID to them

DT <- transform(DT, ID2 = as.numeric(interaction(a,b, drop=TRUE)))

I appreciate any suggestion on how to modify either of the solutions to generate the Expected_ID shown below.

    a b c ID1 ID2 Expected_ID
1:  4 a 1   1   1           1
2:  2 b 2   2   2           2
3: NA c 3   3  NA           3
4:  2 b 4   3   2           2
5: NA c 5   3  NA           3
like image 926
Mahmoud Avatar asked Oct 04 '18 17:10

Mahmoud


1 Answers

The idiomatic way:

DT[, g := .GRP, by=.(a,b)]

    a b c g
1:  4 a 1 1
2:  2 b 2 2
3: NA c 3 3
4:  2 b 4 2
5: NA c 5 3

There's reason to believe this won't be fast, but it turns out it's not too bad compared to competing approaches:

nv = 10
nu = 3
nr = 1e6

library(data.table)
set.seed(1)
DT = do.call(CJ, rep(list(seq_len(nu)), nv))[sample(1:.N, nr, replace=TRUE)]

cols = copy(names(DT))

# "idiomatic" .GRP
system.time(DT[, g := .GRP, by=cols])
#    user  system elapsed 
#    0.23    0.02    0.25 

# sort and count runs
oi = as.call(lapply(c("order", cols), as.name))
system.time(DT[eval(oi), go := rleidv(.SD, cols)])
#    user  system elapsed 
#     0.3     0.0     0.3

# paste 'em
system.time(DT[, gp := match(p <- do.call(paste, c(.SD, list(sep="_"))), unique(p)), .SDcols=cols])
#    user  system elapsed 
#    5.26    0.06    5.32 

# paste 'em, fact'em (@akrun's answer)
system.time(DT[, gpf := as.integer(factor(p <- do.call(paste, c(.SD, list(sep="_"))), levels = unique(p))), .SDcols=cols])
#    user  system elapsed 
#    4.74    0.08    4.82 

# check
identical(DT$g, DT$gp); identical(DT$g, DT$gpf)
uniqueN(DT, "g") == uniqueN(DT, c("g", "go"))

The rleidv way creates different group numbers, but effects the same grouping.

Increasing the size of the problem to nr = 5e7 raised the time to 8s for the .GRP approach; 20s for the rleidv way; and led R to hang for the others on my system.

For anyone interested, more approaches can be found in the R FAQ How to create a consecutive index based on a grouping variable in a dataframe

like image 144
Frank Avatar answered Sep 30 '22 07:09

Frank