I have a data.table of factor columns, and I want to pull out the label of the last non-missing value in each row. It's kindof a typical max.col
situation, but I don't want to needlessly be coercing as I am trying to optimize this code using data.table. The real data has other types of columns as well.
Here is the example,
## Some sample data
set.seed(0)
dat <- sapply(split(letters[1:25], rep.int(1:5, 5)), sample, size=8, replace=TRUE)
dat[upper.tri(dat)] <- NA
dat[4:5, 4:5] <- NA # the real data isnt nice and upper.triangular
dat <- data.frame(dat, stringsAsFactors = TRUE) # factor columns
## So, it looks like this
setDT(dat)[]
# X1 X2 X3 X4 X5
# 1: u NA NA NA NA
# 2: f q NA NA NA
# 3: f b w NA NA
# 4: k g h NA NA
# 5: u b r NA NA
# 6: f q w x t
# 7: u g h i e
# 8: u q r n t
## I just want to get the labels of the factors
## that are 'rightmost' in each row. I tried a number of things
## that probably don't make sense here.
## This just about gets the column index
dat[, colInd := sum(!is.na(.SD)), by=1:nrow(dat)]
This is the goal though, to extract these labels, here using regular base functions.
## Using max.col and a data.frame
df1 <- as.data.frame(dat)
inds <- max.col(is.na(as.matrix(df1)), ties="first")-1
inds[inds==0] <- ncol(df1)
df1[cbind(1:nrow(df1), inds)]
# [1] "u" "q" "w" "h" "r" "t" "e" "t"
Here's another way:
dat[, res := NA_character_]
for (v in rev(names(dat))[-1]) dat[is.na(res), res := get(v)]
X1 X2 X3 X4 X5 res
1: u NA NA NA NA u
2: f q NA NA NA q
3: f b w NA NA w
4: k g h NA NA h
5: u b r NA NA r
6: f q w x t t
7: u g h i e e
8: u q r n t t
Benchmarks Using the same data as @alexis_laz and making (apparently) superficial changes to the functions, I see different results. Just showing them here in case anyone is curious. Alexis' answer (with small modifications) still comes out ahead.
Functions:
alex = function(x, ans = rep_len(NA, length(x[[1L]])), wh = seq_len(length(x[[1L]]))){
if(!length(wh)) return(ans)
ans[wh] = as.character(x[[length(x)]])[wh]
Recall(x[-length(x)], ans, wh[is.na(ans[wh])])
}
alex2 = function(x){
x[, res := NA_character_]
wh = x[, .I]
for (v in (length(x)-1):1){
if (!length(wh)) break
set(x, j="res", i=wh, v = x[[v]][wh])
wh = wh[is.na(x$res[wh])]
}
x$res
}
frank = function(x){
x[, res := NA_character_]
for(v in rev(names(x))[-1]) x[is.na(res), res := get(v)]
return(x$res)
}
frank2 = function(x){
x[, res := NA_character_]
for(v in rev(names(x))[-1]) x[is.na(res), res := .SD, .SDcols=v]
x$res
}
Example data and benchmark:
DAT1 = as.data.table(lapply(ceiling(seq(0, 1e4, length.out = 1e2)),
function(n) c(rep(NA, n), sample(letters, 3e5 - n, TRUE))))
DAT2 = copy(DAT1)
DAT3 = as.list(copy(DAT1))
DAT4 = copy(DAT1)
library(microbenchmark)
microbenchmark(frank(DAT1), frank2(DAT2), alex(DAT3), alex2(DAT4), times = 30)
Unit: milliseconds
expr min lq mean median uq max neval
frank(DAT1) 850.05980 909.28314 985.71700 979.84230 1023.57049 1183.37898 30
frank2(DAT2) 88.68229 93.40476 118.27959 107.69190 121.60257 346.48264 30
alex(DAT3) 98.56861 109.36653 131.21195 131.20760 149.99347 183.43918 30
alex2(DAT4) 26.14104 26.45840 30.79294 26.67951 31.24136 50.66723 30
Another idea -similar to Frank's- that tries (1) to avoid subsetting 'data.table' rows (which I assume must have some cost) and (2) to avoid checking a length == nrow(dat)
vector for NA
s in every iteration.
alex = function(x, ans = rep_len(NA, length(x[[1L]])), wh = seq_len(length(x[[1L]])))
{
if(!length(wh)) return(ans)
ans[wh] = as.character(x[[length(x)]])[wh]
Recall(x[-length(x)], ans, wh[is.na(ans[wh])])
}
alex(as.list(dat)) #had some trouble with 'data.table' subsetting
# [1] "u" "q" "w" "h" "r" "t" "e" "t"
And to compare with Frank's:
frank = function(x)
{
x[, res := NA_character_]
for(v in rev(names(x))[-1]) x[is.na(res), res := get(v)]
return(x$res)
}
DAT1 = as.data.table(lapply(ceiling(seq(0, 1e4, length.out = 1e2)),
function(n) c(rep(NA, n), sample(letters, 3e5 - n, TRUE))))
DAT2 = copy(DAT1)
microbenchmark::microbenchmark(alex(as.list(DAT1)),
{ frank(DAT2); DAT2[, res := NULL] },
times = 30)
#Unit: milliseconds
# expr min lq median uq max neval
# alex(as.list(DAT1)) 102.9767 108.5134 117.6595 133.1849 166.9594 30
# { frank(DAT2) DAT2[, `:=`(res, NULL)] } 1413.3296 1455.1553 1497.3517 1540.8705 1685.0589 30
identical(alex(as.list(DAT1)), frank(DAT2))
#[1] TRUE
Here is a one liner base R
approach:
sapply(split(dat, seq(nrow(dat))), function(x) tail(x[!is.na(x)],1))
# 1 2 3 4 5 6 7 8
#"u" "q" "w" "h" "r" "t" "e" "t"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With